Rapid Spectral Parameter Prediction for Black Hole X-Ray Binaries using Physicalised Autoencoders

Black hole X-ray binaries (BHBs) offer insights into extreme gravitational environments and the testing of general relativity. The X-ray spectrum collected by NICER offers valuable information on the properties and behaviour of BHBs through spectral fitting. However, traditional spectral fitting methods are slow and scale poorly with model complexity. This paper presents a new semi-supervised autoencoder neural network for parameter prediction and spectral reconstruction of BHBs, showing an improvement of up to a factor of 2,700 in speed while maintaining comparable accuracy. The approach maps the spectral features from the numerous outbursts catalogued by NICER and generalises them to new systems for efficient and accurate spectral fitting. The effectiveness of this approach is demonstrated in the spectral fitting of BHBs and holds promise for use in other areas of astronomy and physics for categorising large datasets.


INTRODUCTION
Black hole X-ray binaries (BHBs) consist of a black hole (BH) that accretes matter from a nearby companion star, releasing substantial amounts of energy as X-ray photons through gravitational potential energy loss (Novikov & Thorne 1973).
Two properties, mass (M) and dimensionless spin ( * ), uniquely define a BH in general relativity via the Kerr metric.The mass of the BH can be determined empirically using the mass function of a binary system and the radial velocity (Casares & Jonker 2014).The value of  * can range in magnitude from zero, representing a non-spinning Schwarzschild hole, to one, representing a maximally rotating Kerr hole (Remillard & McClintock 2006), with negative values representing a counter-rotating BH relative to the accretion disk (Shapiro & Lightman 1976).However, there is a practical maximum limit of | * | ≃ 0.998 caused by a counteracting torque from the photons captured by the BH as photons with negative angular momentum have a larger cross-section than photons with positive angular momentum (Thorne 1974).
BHs have two characteristic radii: the event horizon, where the radial escape velocity equals the speed of light, making all information inside inaccessible to outside observers, and the innermost stable circular orbit (ISCO), which is the smallest stable orbit for orbiting matter, usually in the form of an accretion disk (Remillard & McClintock 2006;Heinicke & Hehl 2015).As  * ★ E-mail: et1g19@soton.ac.uk becomes larger and more positive, the ISCO and event horizon shrink and the ISCO radius approaches the event horizon radius.
The gas accreted from the companion star circularises and, via viscous drag forces, dissipates energy while exchanging angular momentum, forming an accretion disk which, during soft states, section 1.1, is observed to extend to the ISCO, or very proximate to it (Cui et al. 1997;Davis et al. 2006).
X-ray spectral analysis provides important constraints on the properties and evolving characteristics of the accretion flow as well as the BH's properties (e.g., mass accretion rate  and  * , e.g., Fabian et al. 2014;Zhao et al. 2021).

Black Hole X-Ray Binary Spectra
As seen during outburst episodes in which the inflow of matter onto the BH evolves over time, BHBs show three primary X-ray spectral continuum components.Figure 1 shows an energy unfolded spectrum (green plus) fitted with the model TBabs(simplcut⊗ezdiskbb) (red line) from the PyXspec software (Arnaud 1996;Gordon & Arnaud 2021), see section 2. (1) The thermal component (ezdiskbb, yellow dashed) is modelled by a ∼ 1 keV-peaked multi-temperature disk model (MTD) produced by a geometrically thin, optically thick accretion disk.(2) The nonthermal component (simplcut, blue dots) is a power law produced by thermal photons Compton upscattering off hot electrons (  ∼ 100 keV) in an optically thin, or marginally-thick corona.(3) The power-law component (blue dotted) extends to high energies and may exhibit a break or an exponential cutoff.(4) The spectrum may also contain an additional nonthermal component referred to as reflection (not shown in Figure 1), which is produced when Comptonised coronal photons illuminate the accretion disk and are reprocessed at its surface, generating a spectral bump between 20 to 30 keV (Esin et al. 1997;Guilbert & Rees 1988;Remillard & McClintock 2006;Done et al. 2007;Gilfanov 2010;Ingram & Motta 2019).The reflection of high-energy photons produces fluorescent excitation of elements in the accretion disk, with the Fe-K emission line at ∼ 6.4 keV being the most prominent.Relativistic effects from the spin of the BH are imprinted on the skewed profile of the resulting Fe-K emission line (Fabian et al. 1989;Bambi 2013).
BHBs can demonstrate significant variations throughout an outburst event (typically lasting many months from start to finish), and the prominence of the different components can change dramatically depending on the system's state (Remillard & McClintock 2006).BHBs spend most of their time in quiescence at exceedingly low luminosities; however, during outbursts, they undergo rapid and dramatic changes in brightness and accretion state (Remillard & McClintock 2006;Dunn et al. 2010).The accretion states follow several possible nomenclatures; broadly, five states are commonly identified.There is a hard low luminosity state (LS) where the corona dominates the spectrum, a high soft state (HS) where the thermal blackbody is dominant, a very high state (VHS) which exhibits both thermal emission and a strong steep power law, intermediate states (INT) between LH and VHS/HS, and finally the aforementioned quiescent state (Esin et al. 1997;Done et al. 2007;Remillard & McClintock 2006).

Spectral Fitting
The current spectral fitting approach involves convolving models with instrumental responses and brute-force curve fitting to achieve an acceptable fit with data (Arnaud 1996).However, analysis with this approach will become increasingly challenging as next-generation microcalorimeters (Tashiro et al. 2020), survey instruments (Kraft et al. 2022), and fast timing instruments (Ray et al. 2019) produce high volumes of spectral data and unprecedented spectral resolution, requiring increased modelling complexity.
de Plaa et al. (2019) identify challenges and improvements of current methods.The main challenge they identified was that the increased resolution reveals more spectral lines, requiring more complex atomic models that are difficult to accelerate without a loss in accuracy.Another challenge is the storage and loading of response files, which can readily become limiting, as high-resolution spectra require significantly larger response matrices to encode the instrumental performance.They implemented new file structures and binning methods to reduce file sizes, improve parallel performance, and reduce execution times; however, machine learning could remove the need for response files in spectral fitting outside of training in certain domains.
In a pathfinder effort, we investigate using an autoencoder architecture inspired by the success of neural networks in image and signal processing (Krizhevsky et al. 2017;Oord et al. 2016;Baldi 2012).Our approach aims to predict spectral parameters more efficiently.
Previous work by Parker et al. (2022) is the most similar to our work, where they implemented a simple neural network to predict active galactic nuclei spectral parameters from simulated spectra of the X-ray Athena survey instrument (Meidinger 2018).Their results show that neural networks can significantly reduce computational time while obtaining similar accuracies.They show another advantage of employing neural networks is avoiding local minima that traditional spectral fitting can get stuck in.

Machine Learning
Machine learning is an implicit set of optimisation algorithms that minimise a score, known as a loss function, that measures the inaccuracy of the algorithm's outputs to the ground truth (Wang et al. 2022).Three main learning methods are supervised, unsupervised and semi-supervised learning.
Supervised learning involves classifying or predicting values and requires prior knowledge of the ground truth labels for training.While supervised learning can provide physical outputs, generating labels can be computationally expensive and require human input.
Unsupervised learning helps find patterns and group data or learn low-dimensional data representations.It can take advantage of large quantities of unlabeled datasets; however, the outputs from the network can be hard to interpret.
Semi-supervised learning aims to improve supervised learning by extending the labelled dataset by inferring labels from unlabeled data or, for unsupervised learning, informing the network of pre-existing groups to improve the accuracy and physical interpretation of the network (Van Engelen & Hoos 2020;Zhu 2005).

Neural Networks
Neural networks are a type of machine learning algorithm formed from differentiable functions with free weights, called neurons, where a group of neurons form a layer, which can have different types, such as linear, convolutional and recurrent.The weights within a neural network are tuned by propagating the loss through the weights using differentiation to minimise the loss (Rumelhart et al. 1986).
Linear layers are composed of neurons with output   that calculate the weighted sum of the neurons in the previous layer,   , as shown in equation 1, where   is the bias of the th neuron and    is the weight of   for the th neuron (Svozil et al. 1997).
Convolutional layers learn spatial features by sliding a kernel of weights across an input.Figure 2 shows the convolution of the input (blue input) with a 3 × 3 kernel of weights (orange kernel) to produce the output (pink output) by calculating the weighted sum of the inputs within the receptive field of the kernel at each step (Li et al. 2022).The stride defines the number of values the kernel moves between each step, so Figure 2 shows a stride of one.A convolutional layer can include several kernels, each optimising for a different feature set; therefore, the output can contain several channels of different feature maps.
Recurrent layers were introduced by Hopfield (1982) to learn sequential relationships from an input x, where the th neuron in the layer produces an output,   , from the weighted sum of one value from the input,   and the value from the output,   −1 , of the  − 1 neuron that used the previous value from the input sequence,   −1 .However, recurrent layers can suffer from vanishing gradients, in which the gradients go to zero, or conversely exploding gradients, which diverge to large values.Cho et al. (2014) proposed gated recurrent units, a modified recurrent network which improves gradient flow back to the input efficiently.
Stacking multiple layers allows neural networks to learn complex non-linear problems; however, this requires activation functions to introduce non-linearity between layers; otherwise, multilayer networks would collapse to a single layer (Rumelhart et al. 1986).The rectified linear unit activation function,  () = max(0, ), and its derivatives have become the most commonly used activation functions due to their success in deep neural networks (Glorot et al. 2011); however, due to the zero gradient for negative values, rectified linear units can become dead, preventing the loss from propagating to the neurons closer to the input.Exponential linear units modify the rectified linear unit to allow negative values, shown in equation 2 with  > 0, preventing dead rectified linear units and improving training times, (Clevert et al. 2015).Scaled exponential linear units further modify the exponential linear unit by setting  ≈ 1.6733 and scaling the output by  ≈ 1.0507 to self-normalise the outputs to improve the stability of training (Klambauer et al. 2017).

Autoencoders
Autoencoders are a type of unsupervised method that compress high-dimensional data to a lower dimensional latent space (pink) Figure 3.The general form of an autoencoder takes an input , encodes it (blue) to a latent space  (pink), and reconstructs the input x using a decoder (orange).The weights within the autoencoder are trained to minimise the difference between x and .The autoencoder, therefore, learns to encode the most useful information required to reconstruct the input in the latent space.
using an encoder (blue) and then reconstruct the input from the latent space using a decoder (orange), as illustrated in Figure 3.The autoencoder is forced to encode the most important information into the latent space, and if the autoencoder is a single layer and the latent variables are orthogonal, then this is equivalent to principal component analysis (Ladjal et al. 2019).One drawback of autoencoders is that their latent space is generally non-physical; therefore, obtaining useful information from it can be challenging (Baldi 2012).
Several techniques have been proposed for extracting useful information from the latent space.For example, one approach is to use a portion of the latent space as variables that can be passed into a physical model, forcing those parameters to be physically grounded, as suggested by Boone (2021).Another technique is to train a classifier on the latent space using semi-supervised learning, as utilised by Gogna & Majumdar (2016); Zemouri (2020).Alternatively, Ntampaka & Vikhlinin (2022) designed their architecture to construct a physical latent space implicitly through physical reasoning of the latent structure and found that the latent variable correlated with the expected physical parameter.

METHOD
In this study, we demonstrate a proof-of-concept neural network for predicting the parameters of BHB X-ray spectra, using a simple model as a starting point to test the feasibility of the network.Specifically, we use the model TBabs(simplcut⊗ezdiskbb) (Wilms et al. 2000;Steiner et al. 2017;Zimmerman et al. 2005) in PyXspec software.
The ezdiskbb model is an MTD radiation model that arises from the thermal emission of accretion disks, subject to the constraint of zero torque at the disk's inner edge (Zimmerman et al. 2005).The model has two parameters: maximum accretion disk temperature ( disk ) and its normalisation (), the latter of which is physically constraining of the disk inner-radius when several properties of the system, such as the disk inclination, source distance, and spectral hardness are accounted for (along with relativistic corrections).We keep  disk and  free for the network to learn.
simplcut is a convolution kernel that Compton upscatters a proportion of the MTD thermal photons in the hot corona.Reprocessed coronal photons, which are reflected off the accretion disk, can be accounted for self consistently, conserving photons (Steiner et al. 2009).The model has four parameters: photon index (Γ); the fraction of thermal photons scattered in the corona (  sc ); the proportion of Compton scattered photons that are reprocessed in the accretion disk (  ) versus reaching the observer directly; and the temperature of electrons in the corona (  ).We set   = 01 and   = 100 keV as fixed parameters, allowing Γ and  sc to be free parameters.
The TBabs model describes X-ray absorption along the line of sight in the interstellar medium.It has a single free parameter, the hydrogen column density (  ) (Wilms et al. 2000).
Our study aims to develop an encoder to generate spectral parameters from input spectra.We enforce physical constraints on the latent space to overcome the limitations of an autoencoder's non-physical latent space by training the autoencoder in two steps.The first is to train the decoder on labelled synthetic data using supervised learning, where the input is spectral parameters, and the output is reconstructed spectra.Then, we train the encoder with unlabelled observations and the trained decoder, which conditions the latent space on the physical parameters.
Our model has two advantages by incorporating this physical constraint on the latent space.Firstly, the encoder can learn to encode the physical parameters directly from the input spectra without first learning abstract representations.Secondly, the decoder can rapidly create synthetic spectra given a set of physical parameters, which could be used to interpolate grids generated by computationally expensive models, such as XSTAR (Bautista & Kallman 2001), bypassing the model by emulating the output to accelerate fitting times.

Data and Preprocessing
We obtained energy spectra for approximately 25 BHB systems from the Neutron Star Interior Composition Explorer (NICER).NICER is a soft X-ray timing spectroscopic mission with its main instrument consisting of 52 active detectors, covering an energy range of 0.2 <  < 12 keV and an energy resolution of ∼ 100 eV (Gendreau et al. 2012).NICER's high temporal resolution and large collecting area enable the collection of useful spectra for BHBs (often 10 4 − 10 7 counts) in short observations ∼1000 s, without being susceptible to pileup effects, allowing for flexible scheduling to collect large quantities of spectra.
We selected the entire catalogue of NICER BHB data as of August 2022 with an initial screening to remove any data overly contaminated by elevated background counts.We use each remaining continuous stare to produce a spectrum, yielding a total dataset of 10,800 spectra.We used a custom pipeline using many components of the NICER data and analysis software suite2 (NICERDAS) to process the raw data, which provided us with the counts against the channel for the spectrum and background and the corresponding redistribution matrix and auxiliary response matrix.Additionally, we performed the following steps to standardise the data to ensure the neural network could learn the spectral features effectively: (i) Exposure normalisation -Normalise the counts and background by the exposure time to obtain the detection rate per channel.
(ii) Background correction -Correct the spectrum by subtracting the background rate.The 3C50 background model was used throughout (Remillard et al. 2022).
(iii) Detector normalisation -Normalise the spectrum by the number of active detectors during the observation.
(iv) Binning -Due to the channel energy resolution being higher than the detector energy resolution, we summed the data using the following bin sizes: 30 eV bins at 0.2 keV, 40 eV bins at 2.48 keV, 50 eV bins at 6 keV, and 60 eV bins up to 12 keV.Then, standardise the data to spectral density by dividing the spectrum by the width of each bin.
(v) Energy range -Limit the energy range to 0.3 <  < 10 keV, over which NICER is most sensitive and corresponds to a typical spectral analysis range.
We adopt a fixed binning to ensure a consistent data size, 240 bins for each spectrum.
However, the intensity of an individual spectrum and the differences between spectra can span many orders of magnitude, leading to challenges with network training, given the dynamic ranges at hand.When using backpropagation to update the weights in the network, large inputs may result in the exploding gradient problem.
To address this, we transformed the spectra into logarithmic space and normalised all spectra between zero and one based on the amplitude range of synthetic spectra.We also take the logarithm of the spectral parameters, except for photon index Γ, and normalise them to follow a normal distribution based on the synthetic parameters (Sola & Sevilla 1997).

Synthesising Data
We used the fakeit function in PyXspec to create synthetic spectra for training based on the model TBabs(simplcut⊗ezdiskbb), with a fiducial exposure time of 1 ks.We randomly select a spectrum from the real dataset to generate a set of 100 synthetic spectra.We used its background, redistribution matrix, and auxiliary response files as the settings for the synthetic spectra.We then randomly assigned values for each parameter in the model for each synthetic spectrum from either a logarithmic or uniform distribution.The parameters sampled from a logarithmic distribution were 0.025 <  disk [keV] < 4, 10 −2 <  < 10 10 , 10 −3 <  sc < 1, and 0.005 <   [10 22 cm −2 ] < 75, while we sampled 1.3 < Γ < 4 from a uniform distribution.We generate two synthetic spectra datasets, with and without noise, as the decoder does not need to learn to reproduce noise; however, training the encoder with Poisson noise can help improve the robustness.
After generating each spectrum, we checked if the sum of the counts was above a threshold value of 10 4 and that there were no integer overflows due to the maximum count exceeding the maximum value of a signed 32-bit integer.However, this minimum threshold leads to a bias towards parameters that produce higher counts.We weighted the parameter probability distribution to favour undersampled values to minimise this bias.We show in Figure 4 the distribution (diagonal histograms) and correlation (scatter plots) of synthetic parameters without anti-bias weighting (orange) and with anti-bias weighting (blue).The Figure shows that the anti-bias weighting does help increase the undersampled values, particularly in parameters  disk , which sees an increase in counts up to 50%, or 42% for the smallest values, and , which sees an increase up to 80%, or 74% for the smallest values.  also benefits from the anti-bias, with the highest values showing an increase in counts by 9.8%.However, the correlation plot between  and  disk shows that part of the parameter space is inaccessible for our imposed constraints.We repeated this process until we generated 100,000 spectra and then applied the same preprocessing techniques mentioned in section 2.1.

Network Architecture
Figure 5 shows the architecture developed for this project.The encoder (blue dashed box, left) learns the spectra mapping to spectral parameters, while the decoder (orange box, right) learns the inverse operation.The text describes the layers in the network, with the column representing the layer.We show linear layers as pink circles, convolutional as blue rectangles and recurrent as orange diamonds.Black arrows with a plus sign represent shortcut connections where we add outputs from the connected layers (He et al. 2016).The convolutional layers alternate using strided convolution, where the stride is two to downscale the input length by a factor of two, and regular convolution, where the input length is preserved (Springenberg et al. 2014).The recurrent layer is a two-layer bidirectional gated recurrent unit (Schuster & Paliwal 1997) with arrows depicting the flow direction; the recurrent's output is the average of the two directions.
We use scaled exponential linear unit activation in each linear layer and exponential linear units for convolutional and recurrent layers to introduce non-linearity.These activations help to prevent gradient instabilities during backpropagation by pushing the outputs of layers towards a mean of zero for exponential linear units (Clevert et al. 2015) or a normal distribution for scaled exponential linear units (Klambauer et al. 2017).
The first five layers of the decoder are linear because they enable non-linear mappings of parameter space to spectral space and expand the low-dimensional input to the high-dimensional output.The first two linear layers have 120 neurons, while the remaining three have 240 neurons each to balance network accuracy and computational time, as layers closest to the input have less information to predict, while deeper layers can learn more complex patterns.The shortcut connection across the three 240-neuron layers improves gradient flow during backpropagation and allows the network to minimise the importance of layers if the extra complexity is unnecessary.The linear layer after the recurrent layer improves the network's accuracy by allowing the linear layer to weight the output from the recurrent features.
We use a two-layer bidirectional recurrent layer to learn the sequential relationships of spectra as the bi-directionality helps the network link the behaviour between high and low energies instead of only providing uni-directional constraints.We use a gated recurrent unit over traditional recurrent layers, which suffer from vanishing gradients, and long short-term memory (Hochreiter & Schmidhuber 1997), which are more computationally expensive and offer negligible performance improvements for small input sizes.
Similar reasoning is applied when designing the encoder; however, we use convolutional rather than recurrent layers to identify spatial relationships of spectra.Each convolutional layer has a kernel size of three and uses 16 filters, and the strided convolution allows the convolution kernels to have different receptive fields.We further explain our choice in network designs in section 3.1.
We used mean squared error (MSE) loss functions with the Adam optimiser (Kingma & Ba 2014) for stochastic gradient descent with a 5 × 10 −4 learning rate and a weight decay of 10 −5 ; we found these produced the most stable training for the decoder and encoder.We initially considered using PyXspec fit statistics as the loss function to judge the encoder's performance; however, this approach proved inefficient due to the time costs of loading spectra with response files into PyXspec and the inability to accelerate the performance using graphics processing units (GPUs).
To further improve training convergence, we implemented a scheduler that reduces the learning rate when the training loss plateaus.Specifically, if the validation loss did not decrease by more than 10% within ten epochs, the learning rate would decrease by a factor of two.This scheduler allows a higher initial learning rate to converge quickly on the optimal solution, avoiding local minima and then reducing the learning rate to converge on the minimum.

Training
We split the supervised and unsupervised datasets into training and validation sets in a 9:1 ratio to prevent the network from overfitting the data.During training, we updated the network's parameters using backpropagation on the training data while assessing its performance on the validation data and tuning hyperparameters to improve its generalisation to new data.
We trained both halves of the autoencoder for 200 epochs.Then, we evaluated the accuracy of the trained encoder on the validation dataset using the reduced "PGStat" in PyXspec (a fit statistic for Poisson-distributed data with Gaussian-distributed background).We are using the 3C50 background model3 , which has systematic uncertainties that we assume are Gaussian, on top of this; as seen in section 2.2, for training the decoder, we randomise the background, further introducing stochasticity; therefore, goodness of fit tests have limited utility due to the shortcomings and imperfect calibrations of the background model.Assuming a perfect model and data, the reduced PGStat should be ∼ 1; however, we do not expect to achieve this ideal value due to our choice of a simplistic model.
We compared the encoder's predictions with five different configurations: (i) the encoder's predictions alone, (ii) the encoder's predictions with ten iterations of PyXspec curve fitting, (iii) previously optimised fit-parameters derived from brute-force PyXspec spectral fits, (iv) fiducial parameter settings, and (v) ten iterations of curve fitting using PyXspec starting from fiducial parameter settings.Fiducial parameters are the default parameter values in PyXspec; therefore, they are unrelated to the data, so they typically match the data very poorly; accordingly, this evaluation sets a benchmark ceiling reduced PGStat as a reference point.

Network Interpretability
Neural networks are infamous for being black box functions; therefore, we utilise layer visualisation and saliency maps to understand how the network makes its predictions.
Layer visualisation helps us identify what the network is learning and what features each layer finds meaningful.During the network's architecture design, we visualised the output from each layer to identify the importance and how different layer types interact.For the fully connected layers closest to the latent space, the hidden outputs will not have much physical resemblance; however, we can view the mappings of hidden inputs to hidden outputs through the weights used.We can directly plot the outputs from the convolutional and recurrent layers, which should resemble higher-level spectral features.
We used saliency maps (Simonyan et al. 2013) to identify which inputs the network focuses on most to improve itself.Saliency maps are the gradients of the inputs relative to the loss function and are calculated using backpropagation, the same method to update the network's weights, where the magnitude of the gradients indicates the sensitivity to a particular input.We calculated the saliency maps for both the decoder, where the inputs are parameters, and the autoencoder, where the inputs are spectra.

Model Development
We investigate several architectural designs and present the four main decoder architectures: (i) a six-layer linear network, (ii) a nine-layer convolutional network, (iii) a seven-layer recurrent network, (iv) and a ten-layer recurrent-convolutional network.After finding the optimal architecture, we perform further training and optimisation to obtain the best performance for the decoder.
Figure 6 shows the architectures.Following the design style in Figure 5, linear layers are pink circles, convolutional are blue rectangles, recurrent are orange diamonds, and black arrows with a plus sign represent shortcut connections.All layers within the pink dotted boxes comprise the linear network.In addition to the pink dotted box, layers within the blue dashed box comprise the convolutional network, and layers within the orange box comprise the recurrent network.The overall architecture shown comprises the recurrent convolutional network.The linear and recurrent networks use a dropout (Srivastava et al. 2014) probability of 1%, and convolutional and recurrent convolutional networks use a dropout probability of 2%.
In constructing the linear network, we found that using several layers of neurons equal to the number of spectral bins worked better than gradually increasing the number of neurons up to the spectral dimensionality; however, more neurons increase the chance of overfitting.We also observed that increasing the number of layers improved accuracy and training stability, but accuracy gains plateaued, and processing time and overfitting increased.We found that two 120 and four 240-neuron linear layers provided the most accurate results on the validation dataset, balancing training accuracy and overfitting.
To reduce overfitting, we added dropout with a 1% probability on all layers except the first and last layers.Because those layers are critical for initially inferring information from the input or reconstructing the spectrum, they are particularly sensitive to dropout; therefore, we opted against its use.Setting the dropout probability too high caused the network to struggle to learn and plateau too early, leading to poor validation accuracy and instability.Setting the probability too low led to overfitting and high training accuracy but poor validation accuracy.
To construct the convolutional network, we utilised three convolutional layers with varying kernel sizes of 9, 5, and 3 to capture spatial features on different scales, with each layer having 16 filters to balance accuracy and computational time.These convolutional layers were placed before the final linear layer to improve the network's accuracy by allowing the linear layer to weight the output from the convolutional layers.Batch normalisation (Ioffe & Szegedy 2015) was applied only to the first convolutional layer to renormalise the hidden outputs and prevent vanishing or exploding gradients; however, we found this introduces training instabilities.
The shortcut connection between the convolutional and recurrent layers in the recurrent convolutional network allows the network to balance the two methods.
Table 1 shows the average MSE of five runs trained for 100 epochs on real data and the training time for one epoch for each decoder.Our best-performing network was the recurrent network, with the highest accuracy, joint second-best training time, and most stable training.The linear network has the fastest training time but the lowest accuracy, the most significant variability in accuracy and often suffers from overfitting.Convolutional and recurrent convolutional networks obtain similar accuracies; however, recurrent convolution is  the slowest, while the convolutional network has the most consistent final loss but the most significant fluctuations in loss per epoch.
The loss functions for one training run per network can be seen in Appendix A. We adopt the recurrent network trained for 200 epochs on synthetic data as our benchmark approach for the rest of the report.However, as the synthetic data has greater diversity and no noise, the network can generalise better; therefore, we reduce the dropout probability to 0.1% and remove dropout from one of the linear layers.The recurrent network has a final MSE of 1.06 × 10 −5 .
Next, we investigate six scenarios in which we compare the training method for the encoder of supervised learning vs semi-supervised.We also compare the network's performance on synthetic and real spectra with fitted parameters.The three training scenarios are: • Encoder trained with an MSE loss on the parameters for a supervised approach.
• Encoder trained using the reconstructions loss from a previously trained decoder for an unsupervised approach.
• Combination of the previous two where the parameter loss has a relative weighting of 0.01 for a semi-supervised approach.
In the unsupervised and semi-supervised approach, the previously trained decoder is used, which helps enforce the physical latent space; therefore, the overall autoencoder is still trained using semi-supervised learning.We test each training scenario with the encoder trained on synthetic and real spectra and evaluate the performance by testing the fits using PyXspec to get the reduced PGStat.Due to variations in the real spectra, we fixed the training and validation datasets for all scenarios.These variations can drastically change the reduced PGStat, which does not reflect the network's accuracy.The variations between the datasets and the learning methods can result in the encoder overfitting or underfitting too much; therefore, for real spectra, we use a dropout of 10% and a learning rate of 5 × 10 −4 , and for synthetic spectra, we use a dropout of 1% and a learning rate of 10 −4 .We apply dropout to all convolutional and recurrent layers; however, we only use dropout on one linear layer for real supervised and semi-supervised learning and none for the rest.
Table 2 shows the median reduced PGStat of five runs trained for 100 epochs.The best performance for training the encoder is consistently achieved using real spectra, with an improvement factor of 1.2-5.3over results obtained with synthetic spectra.For synthetic spectra, unsupervised learning performs the best, whereas supervised learning performs significantly worse by a factor of 6.7.When using real spectra, the three approaches are more consistent, with semi-supervised learning achieving the best reduced PGStat and supervised learning performing the worst by a factor of 1.9.
For the rest of the report, we will use the encoder trained for 200 epochs on real data using semi-supervised learning.The final MSE for the reconstruction and latent space is 1.45×10 −3 , and the reduced PGStat is 62.7.

Model Evaluation and Analysis
We evaluated the saliency of the decoder and autoencoder on the validation dataset.The decoder's saliency produces five values per spectrum, indicating the gradient for each parameter of the spectral model with respect to the loss function.We take the average (  ) and standard deviation (  ) over the dataset and normalise them by the smallest   , as the relative weights are what is important,  shown in table 3. Γ and  sc show the smallest saliencies; therefore, these parameters have the lowest impact on the network's prediction.
In contrast,  disk and  show the greatest saliencies, influencing the network's prediction the most; however, they also show the most significant variance.The autoencoder's saliency has the same dimension as the spectra, so we show four examples.We binned the encoder's saliency into 30 regions to more easily visualise the overall trend, as shown in Figure 7.The Figure shows the reconstructed spectrum (orange) and saliency (green line) alongside the original spectrum (blue) and residuals (blue) compared to a zero line (black).The x-axis represents the energy in keV, the y-axis represents the scaled logarithmic counts for the spectra and residuals, and the y-axis is arbitrary for the saliency.The horizontal line artefacts seen in the spectra of the bottom two plots are due to the logarithmic transform of zero count values, so they are assigned the lowest positive value, resulting in the horizontal lines.
The autoencoder saliencies generally place the most importance on the MTD peak around 0-2 keV.For the softer spectra (left column and bottom right), the network also focused on the point where the power law began to dominate the MTD flux, around 2 keV (top left and bottom right), or at higher energies in the 4-7 keV range (bottom left).Generally above 7 keV, the network places less significance on the values.For the harder spectrum (top right), the saliency is more evenly distributed throughout the energy range.
Next, we compared the parameter predictions obtained from the encoder with those obtained using traditional methods with PyXspec, as shown in Figure 8.The black line shows perfect agreement between the two methods, and the red values are where at least one parameter for a given spectrum is pegged at a minimum or maximum value, as these are generally less well-constrained.
We calculate  2  between the parameters with uncertainties obtained from the traditional fitting to compare the agreement between the five parameters.The  2  values are 0.721 for   , 0.909 for Γ, 1.71 for  sc , 29.4 for  disk , and 21.9 for .We use bootstrapping to calculate the lower limit of the 4  2  null hypothesis leading to the values 124 for   , 38.4 for Γ, 48.4 for  sc , 205 for  disk , and 208 for , all of which are significantly larger than the previously calculated  2  ; therefore, the null hypothesis can be rejected.The comparisons show that the MTD parameters were the least well-modelled, while Γ,  sc , and   have similar values.The power law parameters and  disk have significant deviations due to hard limits at the upper and lower boundaries.
Figure 9 shows the distribution (diagonal histograms) and correlation (scatter plots) of precalculated parameters (blue) and encoder predictions (orange).The encoder correctly predicts the parameter distributions, except for  disk and , which struggle to capture the distribution range at either extremum.The correlation plots reveal no significant correlations between the parameters, except for a strong negative correlation in logarithmic space between  disk and .The upper and lower limits of Γ and  sc and the upper limit of  disk produce spikes in the precalculated parameter distribution due to the hard limits imposed by the model.However, the encoder's predictions display tails instead of reflecting these hard limits.
The final part of this section will compare different fitting methods using PyXspec to calculate the computational time and reduced PGStat for five scenarios: the encoder's predictions, the precalculated parameters4 , fiducial parameters for the model, the encoder's predictions with ten iterations of PyXspec fitting, and fiducial parameters with ten iterations of PyXspec fitting.Fitting using PyXspec is multithreaded using eight threads from an Intel i7-10750H central processing unit (CPU) (Intel 2020).The encoder's parameter prediction uses an NVIDIA mobile 1650 ti GPU (NVIDIA Encoder + Fitting 96.2 ± 2.0 4.76 Fiducials + Fitting 97.7 ± 0.9 77.4 2020).Table 4 shows the averaged time from five runs to generate the parameter predictions and the median reduced PGStat for the 1,080 spectra in the validation dataset for each fitting method.All optimisation methods achieve reduced PGStat values much lower than the fiducial threshold.The precalculated parameters have the lowest reduced PGStat but take the most computational time, having been fitted by brute force.On the other hand, the encoder's predictions have a significantly lower computational time but at the cost of a higher reduced PGStat.The encoder achieves an accuracy similar to ten fitting iterations; however, it improves the computational time by a factor of 2,700.If, instead, we run the encoder on the CPU, the computational time becomes 0.273 ± 0.008 s, 7.6 times slower than the GPU but 360 times faster than PyXspec for similar accuracy.However, using the encoder with ten fitting iterations brings the reduced PGStat close to the precalculated values.To achieve a similar reduced PGStat as the encoder with ten PyXspec fitting iterations requires 60 fitting iterations from the fiducial values and takes 288 seconds, resulting in a three-fold improvement in computational time.

Model Development
Table 1 shows that all decoders achieve similar accuracies, which could be due to their similar architectures, leading to the conclusion that linear layers are crucial in effectively mapping spectral parameters into corresponding spectra.Convolutional and recurrent layers provide accuracy improvements ranging from 10-30%; however, it is necessary to consider other factors such as training time, stability, and consistency when choosing an architecture.
The recurrent convolutional network architecture combines the recurrent and convolutional layers, which leads to improved accuracy over the linear network.However, this approach performs worse than the recurrent or convolutional networks while being more computationally expensive.The recurrent convolutional network also has the joint highest variance in the final loss due to it sometimes overfitting, resulting in a significantly higher loss than on other runs.The degradation in performance may be due to the different layer types disrupting the features that other layers are learning, thereby breaking the spatial relationships learned by the convolutional layers and the sequential relationships learned by the recurrent layers.
The convolutional network provides a 12% improvement over the linear network and has the most consistent final loss of the four networks, but it also exhibits the most unstable training.We can attribute this instability to the batch normalisation process that increases the stochasticity during training.Meanwhile, the recurrent network architecture achieves the best performance, with a 32% improvement over the linear network, while having a competitive computational cost and stable training.Thus, we choose the recurrent network as our adopted optimal architecture.
In the next test, we evaluated different training methods.We observed that training on synthetic data resulted in the worst performance, while adding the unsupervised loss improved the encoder's accuracy.The poor performance of training on synthetic data, even though the dataset is larger and more uniform, could be due to the synthetic data being produced using a model that approximates the spectrum of BHBs, which is less accurate when the BHBs are in harder states with more pronounced spectral reflection components, which consists of features and interplay not accurately captured by our model, and accordingly, making the network less robust to real data.This poor robustness is more evident when we look at the reduced PGStat for supervised learning on synthetic data; the encoder does well when predicting synthetic parameters but is markedly worse on real data.On the other hand, unsupervised learning did not perform as poorly, likely because of the reconstruction requirement, which is a more challenging problem, acting as a regularisation term, forcing the network to generalise more to prevent overfitting.The overfitting of supervised learning on synthetic data can also explain why semi-supervised learning performs worse than unsupervised for synthetic but not for real spectra.
For real spectra, we observed that semi-supervised learning performed the best, which could be due to the supervised learning reinforcing the structure of the latent space, coupled with the unsupervised learning enforcing that the parameters can still successfully reconstruct the spectrum.We found that the encoder predicted parameters outside the training data domain during unsupervised learning.We speculate that the encoder can exploit regions outside the training domain to achieve good reconstructions but at the cost of a loss of correspondence between the latent space and the physical parameters.In contrast, we observed that supervised learning could reproduce the target parameters well, but the reconstructions would be worse; for example, if  disk and  were slightly underpredicted, the accuracy could still be high, but the resulting spectrum's scale would be significantly underpredicted.

Model Evaluation and Analysis
The decoder's saliencies indicate that it mainly focuses on the MTD parameters, possibly due to biases in the model that better represent soft states where the MTD dominates Alternatively, the decoder might find it more important to focus on low energies as they often contain more information (typically, lower energy channels contain the most X-ray counts), or these parameters are more challenging to predict.
Regarding the MTD, the autoencoder saliency shows that for soft spectra, the network focuses on the MTD peak around 0-2 keV, with a greater focus on the initial rise in counts at the lowest energies and the fall when the spectrum transitions from MTD to power-law dominant above the thermal disk's Wien tail, typically around 2 keV.Therefore, the network is capturing the shape and broadness of the spectrum; however, it places little emphasis on the peak's magnitude, suggesting that this does not provide additional information that cannot be inferred from the surrounding data.Soft spectra also show additional peaks in saliency along the power-law component.These peaks are usually in the 4-7 keV range, where the network may find it helpful to know the amplitude here to inform the gradient and, therefore, the photon index; however, due to the almost constant gradient, the network does not need to apply much focus to this region.
For harder spectra, the saliency is more uniform, with peaks roughly every 2 keV, indicating the network emphasises capturing the power-law component's shape.The Fe-K emission line at 6.4 keV is an intriguing feature, as the decoder was trained solely on synthetic data that did not contain any of these lines, and the latent space does not contain any explicit information about the presence of these lines; however, we see that the network ignores this feature, suggesting that it does not substantially affect the parameters.
Comparing the predictions against the precalculated parameters in Figure 8, we can observe that the predictions closely agree with the precalculated parameters, with the   and power law parameters in greater agreement than the MTD parameters.The large  2  for  is likely due to the network's difficulty in predicting values over several orders of magnitudes.As  and  disk are strongly correlated, the network's difficulty with  could explain why  disk also has a large  2  .The worse performance for the MTD parameters also correlates with the larger decoder saliencies, suggesting that these parameters are harder for the network to learn.The power law parameters and  disk show vertical boundary conditions,  sc has a hard boundary at 0 and 1, while the boundaries for Γ and  disk are more artificial, resulting in the twin peaks seen in Figure 9.These peaks, shown by red data points in Figure 8, create artificial degeneracies that could hinder the network's ability to learn the mapping.Alternatively, these peaks could be artefacts of the PyXspec fitting algorithm, causing the fit to get stuck in local minima and locking these parameters to the boundaries instead of finding the optimal value.As a result, the network may be conflicted between predicting an accurate parameter far from the precalculated value or minimising the distance to the precalculated value at the cost of the reconstruction, hence the spread in predictions at these boundaries.The discrepancy between the predictions and precalculated parameters correlates with the relative magnitude of the peaks, further indicating an issue with the PyXspec fitting algorithm.If the problem lies within the PyXspec fitting algorithm, using an autoencoder could help avoid these local minima, potentially improving the accuracy of the fits, similar to what Parker et al. (2022) found.
We compared our network with several different fitting methods in the final test.Our encoder reduces the computational time by a factor of 2,700 compared to PyXspec, as it avoids the computational cost of loading the responses and takes advantage of GPU acceleration.Even when we use a CPU for the encoder, we still see a factor of 360 improvement for similar accuracy, demonstrating the advantage of this different approach.However, our current encoder's accuracy is not yet comparable to the best fit from the precalculated parameters.To address this, we combined the two methods to see if our initial parameter predictions could reduce the computational time while maintaining similar performance.With ten iterations of PyXspec, we achieved similar accuracy to the precalculated parameters.Fitting from fiducial values required 60 iterations to achieve the same accuracy, increasing the computational time by 3-fold, even though they achieved comparable accuracy to the encoder's predictions from only ten iterations.We would expect that the additional fitting required for fitting from fiducial parameters would be similar to the encoder; therefore, this could be due to PyXspec getting stuck in local minima that the encoder could otherwise help avoid, suggesting an alternate use case for the encoder, where it can help PyXspec start from a more suitable location in parameter space and reduce the number of iterations required to achieve high accuracy.

SUMMARY AND FUTURE IMPROVEMENTS
We have presented a novel approach to fitting spectra using semi-supervised autoencoders that significantly reduce the computational time by a factor of 2,700 for establishing initial predictions and a 3-fold speedup in achieving the best fit.The network can rapidly predict spectral parameters for large datasets to provide starting points for further analysis, reducing computational time and the chance for traditional fitting methods to get stuck in local minima.We also produce a decoder as a byproduct that can rapidly produce synthetic spectra for provided spectral parameters.Notably, the current network is rather simplistic and still needs the assistance of traditional fitting to reach a reliable accuracy.For future work, we will test the network on new BHB data collected by NICER since the summer of 2022 to compare the performance of unseen BHB systems.
One of the main challenges we face is the difficulty of scaling, as spectra and parameters can vary over several orders of magnitude, leading to unstable learning of the network.To address this, we currently take the logarithm of these values, which can reduce accuracy at high orders of magnitude.A potential solution is to normalise each spectrum by its minimum and maximum in linear space and have two decoder networks predict the spectrum's shape and minimum and maximum values, respectively.Passing the scale parameters into a linear layer can ensure that the encoder considers scale when predicting parameters.This approach can keep the spectra in linear space, allowing the shape decoder to focus more on the high-count parts of the spectrum and improve the robustness of the noisy low-count regions.However, there could still be significant uncertainty in the scale as the scale decoder would still have its predictions in logarithmic space.
Alternatively, we could use the Poisson uncertainty of the spectra to use a weighted mean squared error as the loss function with the data normalised between the global maximum and minimum in linear space.The weighted MSE could give the network leeway to match the uncertainty of the data points and improve training stability across several orders of magnitude as the uncertainty would scale with the counts.However, the large scale could still prove too challenging to train with low-count spectra with values too small for the network to predict.
We would also like to explore the network's performance for different spectral models.We expect a negligible increase in computational time as model complexity increases, assuming we do not need to make the network deeper.Suppose the network depth needs to increase for more complex models.In that case, we expect any corresponding increase in computational time to be significantly less than what we would see from traditional fitting.Therefore, the expectation is that for more complex spectral models, the speedups compared to brute-force fitting would be even more profound than reported here.
Another area to improve the accuracy would be to train the model to perfectly fit either a single spectrum that the user wants a good fit for or over a dataset that the user wants parameters for, as opposed to training the network once on a general dataset and evaluating performance on the desired spectrum or dataset.This method would have the advantage of improved accuracy at the cost of computational time and reusability of the network.However, instead of training the network from the beginning, we can use transfer learning (Bozinovski & Fulgosi 1976), where we train a generalised pre-trained network on the desired dataset.
Further improvements on the network would be to predict uncertainties in the parameters or the probability distribution for each parameter.A simple method for obtaining the underlying parameter distribution would be to use Hamiltonian Monte-Carlo (Duane et al. 1987), a faster version of Markov Chain Monte-Carlo (Metropolis et al. 1953) method for differentiable functions (including neural networks).Another simple method would be to train multiple networks and assess the variance from their predictions; however, this would measure the precision of the network and not necessarily the accuracy.Another approach would be to use variational autoencoders (Kingma & Welling 2013), which predict the mean and variance of the parameters in the latent space; however, an assumption on the form of the probability distribution is required.
An intriguing avenue for exploration would be to have modular networks that could be combined to form the desired model; for example, we could train three networks: one to predict MTD, another for the power law and another for the Fe-K emission line.The user would then be able to choose which modules they want parameters for without the need to train every combination of models.
Furthermore, we could apply our network outside BHBs and use similar observations from NICER with neutron star X-ray binaries, or else consider other X-ray telescopes, or port this approach to other regimes of black hole systems, such as active galactic nuclei.We are presently testing our architecture on James Webb Space Telescope (Gardner et al. 2006) transmission spectra of exoplanets to predict the molecular composition of their atmospheres and on simulated data of the Line Emission Mapper X-ray probe concept (Kraft et al. 2022) to predict the redshift of galaxies.Figure A2 shows the training loss for one of the runs of the convolutional network.We can see that while the network achieved a good final validation loss of 5.93 × 10 −4 , the training is unstable, with the validation loss rapidly changing between epochs.However, the convolutional network can consistently maintain a validation loss less than the training loss, which helps prevent it from overfitting, resulting in a more consistent final loss.
Figure A3 shows the training loss for one of the runs of the recurrent network.We can see that the training is very stable, and the validation loss is consistently below the training loss, showing that the network is not overfitting the training data.
Figure A4 shows the training loss for one of the runs of the recurrent convolutional network.We can see that the network shows similar instabilities as the convolutional network; however, it also has a smaller margin between the training and validation loss, increasing the chance for the network to overfit.
This paper has been typeset from a T E X/L A T E X file prepared by the author.

Figure 1 .
Figure 1.Energy unfolded spectrum of a BHB (green plus) fitted with the model (red line): TBabs(simplcut⊗ezdiskbb) showing the thermal disk (yellow dashed) and coronal power-law components (blue dots).

Figure 2 .
Figure 2. Convolutional layer with a 3 × 3 kernel of weights (orange kernel) convolved with the input (blue input) to produce an output (pink output) by calculating the weighted summation with the values within the receptive field of the kernel where the first step is shown by the orange box on the input, which produces the top left value in the output (1.65).The stride of the kernel indicates how many values the kernel moves between calculations, with the figure showing a stride of one.Therefore, the dashed orange box on the input shows the second step, which produces the top middle value (1.06).

Figure 4 .
Figure 4. Distribution (diagonal histograms) of biased (blue) and unbiased (orange) synthetic parameters and correlations (scatter plots) for each free parameter.Each distribution would ideally be flat; however, constraints on the generated spectra cause the parameter distributions to be non-uniform.The biased data increases the probability of undersampled data to counter the restrictions on allowed parameters.

Figure 5 .
Figure 5.Our autoencoder architecture with the encoder (blue dashed box, left) and decoder (orange box, right).The text describes each layer, with the column representing the layer.The layers are linear (pink circles), convolutional (blue rectangles) and recurrent (orange diamonds).The recurrent layer is a 2-layer bidirectional gated recurrent unit with black arrows representing the forward direction and grey arrows representing the reverse.Shortcut connections are shown by black arrows with a plus sign.

Figure 6 .
Figure 6.The decoder architectures for the linear network (pink dotted boxes), convolutional network (pink dotted and blue dashed boxes), recurrent network (pink dotted and orange boxes) and recurrent convolutional network (all layers).The text describes each layer, with the column representing the layer.The layers are linear (pink circles), convolutional (blue rectangles) and recurrent (orange diamonds).The recurrent layer is a 2-layer bidirectional gated recurrent unit with black arrows representing the forward direction and grey arrows representing the reverse.Shortcut connections are shown by black arrows with a plus sign.

Figure 7 .
Figure 7.A plot of scaled logarithmic count rates against energy (top subplot) for the spectrum (blue), reconstruction (orange), saliency (green line, arbitrary y-scale), and the residuals (bottom subplot) for four different examples.The top right plot shows a harder spectrum where the peak is shifted to higher energies, while the other three are softer spectra where the MTD is the dominant component.The horizontal line artefacts seen in the spectra of the bottom two plots are due to the logarithmic transform of zero count values, so they are assigned the lowest positive value, resulting in the horizontal lines.

Figure 8 .
Figure 8.Comparison between the encoder's parameter predictions (y-axis) and precalculated parameters (x-axis) in logarithmic space, with a black line to show perfect agreement.Spectra with at least one precalculated parameter pegged at a minimum or maximum value during traditional fitting are generally less well-constrained and are coloured red; otherwise, they are blue.Pegged parameters can be seen by artificial vertical lines most commonly found in Γ,  sc .

Figure 9 .
Figure 9. Distribution (diagonal histograms) of encoder predictions (orange) and precalculated parameters (blue) and correlations (scatter plots) for each free parameter in the model.Boundary conditions are shown by the vertical peaks at the edges of the histogram distributions, particularly seen in Γ and  sc . disk and  are the only parameters to show a correlation.

Figure A1 .
Figure A1.Training (blue) and validation (blue) losses for the linear network during training for 100 epochs.

Figure A2 .
Figure A2.Training (blue) and validation (blue) losses for the convolutional network during training for 100 epochs.

Figure A3 .
Figure A3.Training (blue) and validation (blue) losses for the recurrent network during training for 100 epochs.

Figure A4 .
Figure A4.Training (blue) and validation (blue) losses for the recurrent convolutional network during training for 100 epochs.

Table 1 .
Comparison between the time to train for a single epoch and the average MSE of five runs after training for 100 epochs for the different decoder architectures.The decoder architectures are a six-layer linear network, a nine-layer convolutional network, a seven-layer recurrent network and a ten-layer recurrent convolutional network.

Table 2 .
Comparison of the median reduced PGStat of five runs after training for 100 epochs for different training methods of the encoder.Two factors are compared: the loss function, which can be supervised, unsupervised or semi-supervised, and the training dataset for the encoder, which can be synthetic or real.

Table 3 .
Saliency calculated from the differentiation of the decoder's loss function with respect to each free spectral parameter averaged over all validation spectra (  ) and the standard deviation of the saliencies (  ) normalised by the smallest   , as the absolute value is arbitrary.