Recovering the phase and amplitude of X-ray FEL pulses using neural networks and differentiable models

: Dynamics experiments are an important use-case for X-ray free-electron lasers (XFELs), but time-domain measurements of the X-ray pulses themselves remain a challenge. Shot-by-shot X-ray diagnostics could enable a new class of simpler and potentially higher-resolution pump-probe experiments. Here, we report training neural networks to combine low-resolution measurements in both the time and frequency domains to recover X-ray pulses at high-resolution. Critically, we also recover the phase, opening the door to coherent-control experiments with XFELs. The model-based generative neural-network architecture can be trained directly on unlabeled experimental data and is fast enough for real-time analysis on the new generation of MHz XFELs.

observing the pattern of energy loss in the electron beam that generated the X-rays using an X-band transverse cavity (XTCAV) [17][18][19]. The XTCAV is the dominant method used for pulse measurement at LCLS, but the resolution is not sufficient to observe individual SASE spikes except at the longest wavelengths, and the XTCAV measurement contains no phase information. On the other hand, frequency-domain measurements are relatively straightforward [20,21], and there is a long history in optics of applying iterative phase retrieval to spectral measurements to reconstruct ultrafast pulses (see e.g., FROG, SPIDER, and more recent variations [22][23][24][25][26][27][28]). The same concept can be applied to XFELS, combining corrupted measurements of the X-ray power in both the time and frequency domains to improve the resolution of the time-domain measurement [29,30]. In this paper we extend the work of [30] to reconstruct the full field, giving not only higher resolution of the time-domain power, but also revealing the relative phase between spikes. We do so with an efficient machine-learning implementation that can run at the full beam-rate of future MHz XFELs.

Pulse reconstruction task and proposed method
Formally, our pulse reconstruction scheme seeks to recover the X-ray field, f (t) ∈ C p , given the time domain power, P(t) ∈ R m t , and the spectral power, S(ν) ∈ R m ν , where each measurement is divided into m t and m ν discrete time and frequency measurements respectively and the reconstructed field has p discrete points. In practice, we are only given a corrupted measurement of both domains, which we will denoteP(t),S(ν), and our goal is to find a mapping

P(t),S(ν) → f (t) .
(1) Note that recovery of a unique phase is not strictly guaranteed for one-dimensional functions. However, the correct solution can often be found given sufficient constraints from the time and frequency properties of the function, and it has been shown [31] that a one-dimensional function is nearly uniquely determined from its modulus and the modulus of its Fourier transform. For the case of pulse reconstruction, we must accept certain unavoidable ambiguities, for example the overall phase of the field or time-reversal of the phase for time-symmetric pulses. In the special case of a multi-spike SASE FEL pulse, which can be modeled as a sum of finite, independent Gaussian pulses, we will see that recovery is often feasible and unique. Equation (1) is an inverse problem: while the mapping from power to field is difficult to learn, we can write down a 'forward' mapping Φ : f (t) →P(t),S(ν) that, if given the true field, can predict the associated corrupted measurements: where R t and R ν are the measurement resolution functions in the time and frequency domains respectively, N t and N ν are measurement Gaussian noise in the time and frequency domains respectively, F T denotes a Fourier transform, and '⋆' denotes a convolution. Inverting Eq. (2) is difficult due to the loss of phase information. The existing state-of-art for X-ray FEL pulse recovery [30] works by guessing a solution for f (t), and iteratively updating that guess to improve consistency with Eq. (2). The problem is that the iterative task must be repeated for every pulse; with the LCLS-II upgrade operating at up to a MHz (generating 10 s of billions of pulses for a single experiment) and each iterative solution taking seconds to converge (see Appendix, Section C), recovering the field for every pulse in an experiment is unrealistic. Instead, we will train a neural network (NN) inverse mapping, an approach that has found success for inverse problems in other areas of physics [32], including recent works for twodimensional phase retrieval [33][34][35][36] and accelerators [37]. Though NN training is computationally expensive, inference requires only a single forward-pass through the network, potentially many orders of magnitude faster than the existing iterative solution. Neural networks also can leverage prior information contained in the training set (e.g., the FEL coherence length) which could both recover a higher resolution amplitude and also reveal the phase of the field. In our case, the NN's input features are the power measurements, x:P(t),S(ν), and the labels are the complex field, y: f (t). The complex field labels, y, consist of either the field's real and imaginary components, R{f (t)}, I{f (t)}, or by writing the field as f (t) = A(t) exp(iφ), the amplitude and phase, A, φ; in the first case we form y by concatenating R{f (t)} and I{f (t)}, and in the second case by concatenating A and φ, in either representation making a vector of 2p labels for each example. We then train the network by minimizing the difference between the predicted labels, y pred , and the ground truth labels, y GT , with a mean absolute error loss function where the sum is over the n training examples and the 2p labels in each example. For the amplitude/phase representation we again use mean absolute error for the amplitudes, but use a modified loss function on the phase portion of the labels The sine loss was designed so that a phase error of π has maximal penalty, and an offset of 2π is not penalized. Including A GT ensures that phase errors are ignored in regions with zero amplitude where the phase is undefined. In the end, we will train with the real/imaginary representation, but the phase representation is useful for evaluating the performance separately on amplitude and phase components. The difficulty in training a NN with Eq. (3) again lies in the phase recovery: because the absolute phase cannot be measured, each pulse corresponds to a family of solutions with ξ a manifold of solutions that are all equally valid. During training, there is no way for the network to know which particular solution from the manifold was chosen as the label, and the loss between the given label and another equally valid solution on the same manifold could be as large as that of an incorrect solution on a different manifold. Figures 1(c) and 1(d) illustrate an example: given the solid lines as labels, Eq. (3) would strongly penalize a NN for producing the equall valid dashed lines.
In this paper, we employ two solutions to handle the arbitrary absolute phase. First, we select a single solution on the phase manifold so that each example corresponds to a well-defined label. We chose to define ϕ = 0 (or equivalently purely real field) in the center of the pulse, and then train using Eq. (3). Though the definition collapses the phase manifold to a single point, if the center of the pulse is a waist between spikes, the phase may be changing rapidly, which will amplify errors. We will call this first approach the 'L mae NN.' Second, rather than restricting the labels to a single point on the manifold, we instead map the entire manifold back onto the well-defined input space, and define the loss function on the input space. The loss function then becomes where Φ is the forward model defined by Eq. (2). This second approach is functionally equivalent to the physics-informed neural network (PINN), which was previously introduced to solve inverse problems in the context of partial-differential equations [38]. We will use the PINN terminology going forward. We can understand the PINN approach through the concept of generative NNs: rather than trying to predict a specific label, the NN learns to draw samples from the correct manifold. We start by considering a generative adversarial network (GAN) [39], which consists of two networks: a generator that creates new examples given a random 'latent' variable, and a discriminator that classifies examples as from the training set or from the generator; the training process rewards the discriminator for correctly classifying examples, and rewards the generator when it tricks the discriminator. In our case, the first network would need to be a "conditional" generator [40], i.e. it would take power measurementsP(t),S(ν) as inputs, and then generate corresponding example fields R{f (t)}, I{f (t)} as outputs. The generators task is to learn to produce outputs drawn from the manifold corresponding to the inputs.
In a typical GAN, the discriminator's role is to identify examples drawn from an incorrect manifold. In our setup we can replace the learned NN discriminator with the known, physical forward model, Eq. (2), which maps the generator's output back onto the measured input power space. This forward model loss function plays the same role as the discriminator, penalizing solutions from the generator that are drawn from the wrong manifold. Crucially, by writing the forward model in a differentiable language (in this case PyTorch), we can directly apply the forward model to the NN output and train with standard NN platforms.
Before continuing, we note that similar forward-model approaches have been implemented over the last few years. For example, automatic differentiation has been used to incorporate physical models into inverse problems [34,[41][42][43], and is expected to play a growing role in machine learning for the sciences broadly [44]. We also note similarity of our approach to the recently published phaseGAN [36], which is in turn derived from the cycleGAN architecture [45]; both make use of untrained models to build loss functions without labels. In a similar spirit, recent works have used another type of generative model known as a variational autoencoder to resolve ambiguities in signal reconstruction [46,47]. Finally, we note that in our example we do not want to map out the full manifold, so there is no need to input a random latent variable. As such our architecture is not strictly-speaking a generative network, and instead becomes equivalent to a PINN.
We present both training architectures in Fig. 2. We highlight that Eq. (6) does not include y GT , meaning that the PINN is trained without labeled data; the only constraint is that the predictions must be consistent with the measurements, as defined by Φ. We can further extend the PINN concept to also force the predictions to obey statistical properties. For example, FEL pulses have a correlation length given by the Pierce ρ-parameter, which can be observed through the first Fig. 2. Schematic of the training flow. At left, the input to the NN inverse model is the power and spectrum from either measured or simulated pulses. The NN outputs are the real and imaginary field components (center, green). If using simulations, a standard labeled loss function (e.g., Eq. (3)) updates the NN parameters to minimize the difference between predicted and simulated fields. Alternatively, a physics forward model, Φ, maps the predicted outputs into predicted inputs (right, green), and an unlabeled loss function (Eq. (6)) updates the NN parameters by comparison to the true inputs. Additional loss terms can be imposed directly on the statistical properties of the predictions, for example the first-order correlation, g 1 (Eq. (7)). Combinations of all three loss functions are also possible. order correlation function [48] where the expectation values are taken over time, t, and τ is the time-domain delay. Just as Eq. (6) forces the solution to be consistent with the measurements, the L g1 loss penalizes solutions for not matching the expected g 1 (τ) function: where d is the number of delays in the comparison. The ground truth correlation function, g 1GT , can be approximated analytically from the measured FEL gain length (see e.g., Eq. 6.67 in [48]) or calculated numerically from a small number of simulations, so as in Eq. (6) there is no need for a full labeled dataset. Note that g 1 is a statistical property, and the expectation value in Eq. (7) is taken over all examples in each training batch, so there is no explicit summation over examples in Eq. (8).

Results and discussion
We have implemented NNs using both a standard labeled loss function with L mae , as well as a PINN trained with L Φ . The simulation parameters are taken from typical short-pulse, soft X-ray mode at LCLS, with 2.5 nm wavelength, 8-12 fs pulse length, and 2-5 spikes per pulse. The training set consists of 120k 1D FEL simulations [49], with a split of 100k/10k/10k for training, validation, and testing. The L mae was trained with both real/imaginary and amplitude/phase labels and both had similar performance, so we only present the real/imaginary-trained NN here. We run simulations with various diagnostic resolutions R t and R ν , and apply measurement noise of 2% during training and testing. Because the forward model may not be known perfectly, particularly for the measured R t resolution function of the time-domain measurements, we introduce errors during training to make the network less susceptible to R t errors during inference.
In the examples given, we train with an average R t 10% higher than the value used in the test set. We then randomly vary the value of R t by 15% rms during creation of the training dataset for both methods, and additionally in the loss function (Eq. (2)) while training the PINN. Note that robustness to systematic errors has not been studied and requires more assessment. We start by assuming FWHM resolutions of R t = 3 fs for the XTCAV and R ν = 80 meV for the spectrometer, with results shown in Figs. 3 and 4. Table 1 gives summary statistics for other resolution values. To give a sense for the quality of predictions, we show additional pulse examples in the appendix, Figs. 5 and 6. We find that training with L mae gives the best performance. However, some users may still prefer the PINN because it takes only unlabeled inputs for training, and thus is trainable directly on experimental data; there is no need to simulate a new training dataset matching the exact experimental conditions, and the parameters of the training set are guaranteed to match those of the test data. The PINN is also more resistant to over-training, with equal losses for training and test sets. A combined training, using both L Φ and L g1 loss functions and with L mae applied to the amplitude portion of the outputs, had the best performance predicting field amplitudes (see Appendix, Section A). The higher accuracy of the labeled NNs compared to the PINN is expected; because the labeled NNs are trained on pulses drawn from the same distribution as the test data, these models effectively have a strong prior influencing the field reconstruction. The poor temporal resolution of the power measurement increases the impact of the L mae NN's prior. While using the L g1 loss gives the PINN weak prior information, additional physical constraints may be needed for the PINN's performance to equal that of the L mae NN.     5. Example plots of reconstructions using the L mae NN, with field errors of 0.11 for row 1 (median score for 5% cut on forward model error), 0.31 for row 2 (worst case for 5% cut on forward model error), and 0.55 for row 3 (99 th percentile worst score over all). All plots use R t = 3 fs and R ν = 80 meV. Fig. 6. Example pulse reconstructions using the PINN, with field errors of 0.14 for row 1 (median score for ' C = ' a = 0), 0.28 for row 2 (median score for 5% cut on forward model error with ' C = 3 fs, ' a = 80 meV), and 1.03 for row 3 (90 th percentile worst score over all with ' C = 3 fs, ' a = 80 meV).

Fig. 6.
Example pulse reconstructions using the PINN, with field errors of 0.14 for row 1 (median score for R t = R ν = 0), 0.28 for row 2 (median score for 5% cut on forward model error with R t = 3 fs, R ν = 80 meV), and 1.03 for row 3 (90 th percentile worst score over all with R t = 3 fs, R ν = 80 meV).
The PINN's predicted fields have larger errors than the L mae NN's, but in many examples the corresponding forward-model errors are approximately the same. This observation could indicate inherent phase ambiguity, which may be present for fields with a high degree of symmetry. However, for SASE FEL pulses with at least a few spikes, the phase errors appear dominated by the poor time-resolution of the XTCAV diagnostic. As a check, we trained a PINN under the assumption of perfect XTCAV resolution (see final line Table 1). We find field errors decrease by a factor of 3.7 and phase errors decrease by a factor of 3.3 compared to the standard 3 fs case, confirming the phase reconstruction errors in Figs. 3 and 4 are dominated by diagnostic resolution rather than inherent ambiguities in the phase. (See Figs. 6(a) and (b) for an example pulse with median error.) Indeed in this idealized case, the PINN nearly matches the L mae NN performance. Even though the PINN is not able to fully resolve features at the time-scale of single spikes with realistic diagnostics, the predictions still provide more information about the amplitude and phase than is currently available to users.
With any machine learning algorithm, it is important to understand when the method fails. The PINN has the advantage that it is trained by the same metric as used in [30], so a user can easily compare the results through the residuals of the predicted inputs. On the other hand, in Fig. 4 the L mae NN's field errors are correlated to L Φ , which can be calculated even for test data. We can use this correlation to throw away points that are more likely to have large field errors. For LCLS-II, when an experiment can collect a million shots per second, it may be feasible to throw away questionable data aggressively. For example, a cut retaining just 5% of the data reduces the L mae NN's median field label error by 30% and the maximum error by more than a factor of three (see red points in Fig. 4). Similarly for the PINN, a 5% cut reduces the median field label error by 40% and the maximum field error by a factor of 2. Future work could also estimate the uncertainty of the networks themselves on a shot-by-shot basis, for example using Bayesian dropout during inference [50].
While the iterative method used in Refs. [29,30] does not provide estimates of the phases, we can compare the NN and iterative performance on the field amplitudes. The NN approaches achieve higher accuracy than the iterative method, with the L mae NN's median amplitude errors more than a factor of 3 smaller, and the PINN's errors 20% smaller than the iterative method. As expected, the NN is also many orders of magnitude faster, capable of processing millions of shots in a second (see Appendix, Section C). It may also be possible to combine the NN and iterative methods, using the NN as a starting guess for the iterative process; this should allow faster convergence than starting from a random guess, while providing strictly equal or better performance compared to the NN. Similar ideas have been implemented in other phase retrieval problems [35,51].

Conclusion and future outlook
We have presented a method to recover the full field, phase and amplitude, shot-by-shot for XFEL pulses, opening a path towards coherent-control experiments. We describe two possible architectures, one trained on standard labeled data, and one trained solely using a physical model, allowing that version to be learned directly from unlabeled experimental data. Both methods recover the amplitude with higher accuracy-by a factor of 3 for the labeled NN-than the existing iterative method. With sufficient diagnostic resolution, the proposed NNs also have the ability to recover the phase, which has yet to be demonstrated for multi-spike SASE XFEL pulses by any method. Finally, the NNs have potential to run in real time for MHz repetition rate machines.
Additional applications of the concept may be found for advanced operating modes, for example to resolve microbunching-instability sidebands in seeded FELs [52][53][54][55][56], which can then be accounted for during data analysis [57]. The examples given in this work are for soft X-ray parameters, and though there is no hard limit, the performance is expected to degrade as the number of modes increases and the XTCAV resolution decreases at shorter X-ray wavelengths.
On the other hand, seeded FELs have long coherence lengths, and an XTCAV likely would have sufficient resolution for application to hard X-ray self-seeding [58][59][60][61].
One advantage of the NN approach is the computational efficiency, which makes it well-suited to high repetition-rate XFELs. However, LCLS time-domain measurements currently rely on an XTCAV which runs at only 120 Hz, limited in rate both by the need for strong X-band streaking and the damage threshold of the electron YAG screen. Taking full advantage of the NN speed will require development of new diagnostics. For example, replacing the X-band cavity with a passive dechirper would avoid the need for a high-rate, high-voltage X-band source [62]. Alternatively, bending radiation can reveal microbunching imprinted on the beam by the FEL process [63], avoiding need for an electron screen. As described in the introduction, direct measurement of X-rays is possible for short pulses [12][13][14][15][16]. Nearly all of these methods could also benefit from including the diagnostics analysis itself as part of the machine learning pipeline; for example, the NN could take XTCAV image data directly as input and treat the XTCAV analysis as part of the forward model [19], or for attosecond pulses the NN could take as inputs the X-ray diagnostics [15,16]. Ultimately, verification of any XFEL phase prediction will require development of phase-sensitive experiments.

Appendix A. Additional examples of reconstructions
To give a sense of the quality of reconstructions associated with various levels of field errors, we show additional examples using both models. Figure 5 shows three additional examples using the L mae NN: Row 1 (field error = 0.11) and row 2 (error = 0.31) give the median and worst case values respectively for the 5% of shots with the lowest forward model error [i.e., the red points in Fig. 4(a)]. Row 2 is also equivalent to the 90 th percentile worst case for all pulses. Row 3 (error = 0.6) is the 99 th percentile worst case for all pulses. Figure 6 shows three examples using the PINN: Row 1 (error = 0.14) is the median value for a model trained with perfect diagnostic resolution (R t = R ν = 0). Row 2 (error = 0.3) is the median error for the top 5% of shots with the lowest forward model error in Fig. 4(a). Row 3 (error = 1.0) is the 90 th percentile worst case for all pulses. The high-quality predictions for the model trained with perfect diagnostics [ Table 1 and Fig. 6(b)] suggest that the phase errors in models with realistic diagnostic resolution [e.g., Fig. 6(f)] are driven by the resolution and not inherent phase ambiguity.

Appendix B. Neural network details
The models presented in the paper use the same inputs and outputs, as well as the same architecture, with the only differences in the loss function and regularization. Each input consists of the time-domain power (n p = 180) concatenated with the spectral power (n s = 200), for a total length of 380 points per example. The outputs consist of the real and imaginary components of field in the time domain, with a total 200 points (p = 100) for each example. To improve the spectral resolution we zero-pad the simulated time-domain field to make the time window 5-fold larger. The networks contain four convolutional layers, each with 20 filters and a stride of two, and nine fully-connected layers, each with 200 neurons. All activations are rectified linear units (ReLU) except for the output layer, which has linear activation. The losses described in the main text (including the forward model) are implemented as custom loss functions written in PyTorch. See Table 2 for architecture details.
To reduce over-fitting and match typical experimental noise, both inputs include a Gaussian noise layer of 2% rms noise. The standard NN also uses dropout of 0.05 on the first fully-connected layer. The PINN did not benefit from additional regularization. The training consisted of 3000 epochs with a mini-batch size of 1000 and learning rate of 3 × 10 −4 , followed by 3000 epochs with a mini-batch size of 5000 and learning rate of 1 × 10 −4 . All training used the Adam optimizer with default PyTorch settings for beta of 0.9 and 0.999. We did only rough optimization of hyper-parameters (regularization, network width and depth, mini-batch size), and improved performance should be possible from additional optimization. The PINN loss improved by 10% after doubling samples from 50k to 100k, suggesting additional gains are possible from a larger test set. For the PINN, we apply additional constraints on the coherence length (g 1 ) and smoothness of the solutions. The smoothness constraint penalizes the gradient of the prediction, where t is the time index and the sum is over all n examples in the minibatch. The total loss function for the PINN is then The PINN results in the paper use λ g1 = 0.1 and λ s = 1 for the first 3000 epochs, and either λ s = 0.1 for the final 3000 epochs or λ s = 0.01 (for the perfect resolution case). As with other hyperparameters, we expect performance gains are possible with careful optimization. For the combined model we use Using λ mae = 0.5, we find the combined model has smaller errors on amplitudes (1.0 compared to 1.3 amplitude error), but worse performance on phase (2.1 compared to 0.8 sine error). The combined model outperforms the PINN on both metrics, but loses the benefit of unlabeled training.