X-Ray Redshifts of Obscured Chandra Source Catalog AGN

We have computed obscured AGN redshifts using the XZ method, adopting a broad treatment in which we employed a wide-ranging data set and worked primarily at the XZ counts sensitivity threshold, culminating with a redshift catalog containing 121 sources that lack documented redshifts. We considered 363 obscured AGN from the Chandra Source Catalog Release 2.0, 59 of which were selected using multiwavelength criteria while 304 were X-ray selected. One-third of the data set had cross-matched spectroscopic or photometric redshifts. These sources, dominated by low-$z$ and low-$N_H$ AGN, were supplemented by 1000 simulations to form a data set for testing the XZ method. We used a multi-layer perceptron neural network to examine and predict cases in which XZ fails to reproduce the known redshift, yielding a classifier that can identify and discard poor redshift estimates. This classifier demonstrated a statistically significant $\sim$3$\sigma$ improvement over the existing XZ redshift information gain filter. We applied the machine learning model to sources with no documented redshifts, resulting in the 121-source new redshift catalog, all of which were X-ray selected. Our neural network's performance suggests that nearly 90% of these redshift estimates are consistent with hypothetical spectroscopic or photometric measurements, strengthening the notion that redshifts can be reliably estimated using only X-rays, which is valuable to current and future missions such as Athena. We have also identified a possible Compton-thick candidate that warrants further investigation.


INTRODUCTION
Accretion of galactic material onto central supermassive black holes at an extraordinarily high rate results in the most powerful non-explosive class of objects in the Universe, known as Active Galactic Nuclei (AGN; Hickox & Alexander 2018). The population of several million documented AGN across redshifts up to ∼7.6 (Wang et al. 2021) provides a deep window into the history and evolution of galaxies.
According to the Unified AGN Model (Antonucci 1993;Urry & Padovani 1995;Netzer 2015), a toroidal structure of optically thick material looms at the edge of the AGN accretion disk. The diverse geometrical orientation of galaxies relative to the line of sight results in the obscuration of accretion disk emission by this torus for many such orientations, and hence a large portion of AGN are expected to be obscured. This hypothesis is strongly supported by Cosmic X-ray Background (CXB) measurements (see, e.g., Moretti et al. 2009, Cappelluti et al. 2017, in which the CXB spectral peak at E ∼ 30 d.sicilian@miami.edu keV has been attributed to an obscured AGN population with spectra dominated by hard X-ray emission due to absorption at lower energies (Gilli et al. 2007;Treister et al. 2009;Ballantyne et al. 2011;Akylas et al. 2012;Ueda et al. 2014;Aird et al. 2015a;Aird et al. 2015b;Hickox et al. 2019;Ananna et al. 2020a). A complete account of accretion and its role in galaxy evolution is therefore reliant on identifying and studying obscured AGN.
In particular, measuring redshift (z) is valuable when piecing together this cosmological history via various means such as studying AGN accretion history and further developing AGN population synthesis models via the luminosity function and space density measurements (see, e.g., Boyle & Terlevich 1998, Cowie et al. 2003, Ueda et al. 2003, Gilli et al. 2007, Hasinger 2008, Treister et al. 2009, Ueda et al. 2014, Aird et al. 2015a, Buchner et al. 2015, Ananna et al. 2019, Ananna et al. 2020a, Kirkpatrick et al. 2020, Ananna et al. 2020b). However, computing redshifts of obscured AGN is well known for being a highly difficult process (see Simmonds et al. 2018, Peca et al. 2021, and others referenced therein), due largely to the host galaxy contributing more to the overall emission than in the case of an unobscured AGN.
Spectroscopic redshifts using features in the UV to near-IR bands are ideal and were central to the discovery of obscured AGN (Khachikian & Weedman 1974;Weedman 1977;Hickox & Alexander 2018), but are generally not practical due to the considerable required observing time and sensitivity limitations of current instruments.
Photometric redshifts, which consider spectral energy distributions (SEDs), are often used instead. These are reliable for galaxies without AGN, particularly when the SEDs have distinct features such as Lyman and Balmer breaks, emission lines, etc. (Simmonds et al. 2018). An AGN, however, can significantly contribute to the total emission of its host galaxy, making it difficult to disentangle the contributions of the two and therefore attempts to estimate the photometric redshift will often yield multiple degenerate solutions (Salvato et al. 2009).
To avoid the issues encountered by traditional spectroscopic and photometric redshifts for obscured AGN, Simmonds et al. (2018) turned to X-ray spectra, devising a method known as XZ to compute the redshift using only X-ray spectral features-namely the photoelectric cutoff and 7.1 keV Fe Kα absorption edges and, when possible, the 6.4 keV Fe Kα emission line. The XZ method is similar in philosophy to the approaches taken in various other works (see, e.g., Maccacaro et al. 2004, Braito et al. 2005, Civano et al. 2005, Iwasawa et al. 2012, Vignali et al. 2015, Iwasawa et al. 2020, Peca et al. 2021), but XZ is distinguished by its powerful statistical foundation, streamlined modeling procedure (Buchner et al. 2014a, Simmonds et al. 2018, and extensive publicly-available software, which have brought the method success in estimating obscured AGN redshifts. In addition to its practicality, deriving redshift estimates from X-ray spectra introduces another advantage to analyzing X-ray selected obscured AGN, namely that it does not initially require matching a given AGN with multiwavelength counterparts. Such matching is known to be difficult due to the potentially large uncertainties in X-ray position (see, e.g., Hsu et al. 2014, and hence future X-ray telescopes with large PSFs could take advantage of this method. Moreover, future CSC releases and surveys performed by recently-launched or future X-ray missions such as eROSITA (Predehl et al. 2010, Merloni et al. 2012, Predehl et al. 2021, Athena (Nandra et al. 2013, Cassano et al. 2018, Barret et al. 2020, Piro et al. 2021, AXIS (Mushotzky et al. 2019), and Lynx (Gaskin et al. 2018, Ozel 2018, Gaskin et al. 2019) will initially lack counterparts. Therefore, it is vital to have well-tested and effective tools to characterize and analyze the physical nature of observed objects in vast, X-ray-only data sets. In particular, for X-ray AGN with no multiwavelength counterparts, an exclusively X-ray approach can yield reliable redshift estimates without the need for other surveys.
In this work, we first test the XZ method on obscured AGN with documented redshifts, as well as on a simulated data set, to explore the limits and further constrain the capabilities of XZ, ultimately formulating a machine learning-based procedure to maximize its effectiveness on broad X-ray data sets. We then apply XZ to a large set of obscured AGN without documented redshifts in the Simbad or SDSS DR12 databases to produce a catalog of redshift estimates. Finally, we examine the best-fit models and parameters of interest to identify any peculiar or interesting sources, such as Compton-thick AGN, that warrant further study.
Our implementation of machine learning comes in the midst of a growing trend in astrophysics, data science, and other fields in which researchers are using such algorithms to detect complex patterns in data sets with many-dimensional parameter spaces (see VanderPlas et al. 2012, Ferguson 2018, Baron 2019, Schmidt et al. 2019. We detail the algorithm and the realization of its capabilities on analyzing XZ results, as well as on applying them to our final redshift catalog in Section 5.2. With the power of this machine learning approach, we produced a refined set of redshift estimates for 121 obscured AGN, none of which have cross-matched documented redshifts and all of which were selected using only X-rays. For this analysis, we considered archival data from Chandra due to its superb angular resolution relative to other observatories. In particular, its 0.5 arcsec resolution allows it to resolve as much as ∼90% of the AGN believed to compose the CXB (Hickox & Markevitch 2006, Hickox & Markevitch 2007, Cappelluti et al. 2017, providing the deepest look into AGN among all current X-ray observatories. We also employed the Chandra Source Catalog Release 2.0 (CSC; Evans et al. 2010;Evans et al. 2019), which has been shown in several recent works to be an excellent resource for large-scale data analysis (e.g., Kovlakas et al. 2020, Sicilian et al. 2020) and is increasingly well-equipped for data science-oriented applications. We adopt a ΛCDM cosmology throughout, with H 0 = 67.7 km s −1 Mpc −1 , Ω M = 0.307, and Ω Λ = 0.693, which are approximately the results of Planck 2015 (Planck Collaboration et al. 2016).
to modeling X-ray spectra 1 . In particular, BXA utilizes the Nested Sampling algorithm (Skilling 2004), an MCMC-like process that yields posterior parameter probability distributions from priors while exploring the parameter space more efficiently than MCMC (see Buchner 2014b for a demonstration of Nested Sampling's capabilities). Additionally, XZ incorporates an advanced background model that operates on principal component analysis (PCA), first published in Simmonds et al. (2018) and now available in BXA. PCA is a machine learning data analysis technique, and BXA's application of the procedure to background modeling has been established as a highly effective approach in various recent works (see, e.g., Brusa et al. 2021, Liu et al. 2021, Arcodia et al. 2021, Malyali et al. 2021, Wolf et al. 2021, Zhu et al. 2021). Simmonds et al. (2018) considered 321 sources from the Chandra Deep Field South (CDF-S) in which all sources were accompanied by robust optical redshift measurements, unlike our data set which is dominated by X-ray selected AGN with no cross-matched counterparts. The XZ model is BXA's torus + scattering detailed in Buchner et al. (2014a), with torus referring to the Brightman & Nandra (2011) TORUS model (which includes an absorbed power law and accounts for torus geometry, expressed here as the opening angle θ op and torus viewing angle θ view ; see Nandra 2011 andBuchner et al. 2014a for additional details) and a scattering power law component corresponding to an apparent leakage of AGN emission (Ueda et al. 2007) in a process separate from both Compton scattering and reflection (see Krolik & Kallman 1987, Turner et al. 1997, Guainazzi & Bianchi 2007, Buchner et al. 2014a for further explanation; throughout this work, as in Simmonds et al. 2018, we express the scattering power law normalization as a fraction f scat of the AGN power law normalization). XZ spectral fitting in Simmonds et al. (2018), as well as in this work, was done using Sherpa (Freeman et al. 2001).

The Original XZ Method
Galactic absorption was applied to the model, solar abundances were assumed, and uniform priors were used for both redshift and source N H (measured in units of cm −2 hereafter, even when not explicitly specified). The posterior redshift distribution for each spectrum was screened for high Information Gain (IG), which quantifies its divergence from the uniform prior, leaving 74 sources for final evaluation. The IG was computed using the Kullback-Leibler definition (shown below, measured in bits; Kullback & Leibler 1951) and the threshold adopted for high IG was 2 bits.

IG = Posterior(z) log 2
Posterior(z) Prior(z) dz bits Among these sources, both spectroscopic and photometric redshift values were reliably reproduced and constrained by XZ. The difference between the XZ estimate and known redshift value was generally very small, with ∼60% of known redshift values being within 1σ of XZ and ∼75% of XZ best-fit values falling within 0.15(1 + z) of the known value. Another important evaluation of XZ was the uncertainty of the redshift estimate, σ(z), which was ∼0.2 on average. As noted by Simmonds et al. (2018), while this is not ideal, it is a sufficient result for XZ's purpose, which is to offer a good redshift estimate when other methods are unavailable or impractical. Thus, the work was able to conclude that fits with high redshift IG can produce good redshift estimates that are satisfactorily well-constrained.

XZ Method Updates
Our implementation of XZ remains largely faithful to the original work, with our model differing only in the use of the Wilms et al. (2000) measured ISM abundances and Verner et al. (1996) cross sections in an APEC component imported from XSPEC (Arnaud 1996). However, our approach is broader and more aggressive due to our goals of testing its limits and providing redshifts for a large collection of AGN with no reported literature value. In particular, we explored the low-counts boundary of the simulated sensitivity threshold (∼150), generally considering lower-counts spectra than Simmonds et al. (2018), and we also were not limited to sources with multiwavelength counterparts. Hence, the majority of the data set consists of low-statistics, X-ray only spectra. We also searched widely for obscured AGN by querying the entire CSC, whereas XZ's ideal sources come from deep surveys such as those from CDF-S in Simmonds et al. (2018) due to an increased obscured fraction for higher redshift and lower luminosity (see Ueda et al. 2003, Hasinger et al. 2005, making our analysis a substantially broadened and more aggressive application of XZ. As a result, we must further explore the method's accuracy and limitations. Although BXA's nested sampling procedure employs the Cash statistic (CSTAT; Cash 1979) which is valid for low counts without rebinning, prior to performing any analysis we rebinned all low-counts spectra such that each bin contained a minimum of 5 counts. This was done primarily to estimate goodness of fit using the widely familiar and conventional reduced χ 2 since CSTAT is distributed approximately as ∆χ 2 for counts per bin 1 (Lanzuisi et al. 2013). As shown in that work, 5 counts per bin is sufficient for this approximation. Before producing the final redshift catalog, we calibrated our procedure on both a subset of our sources with known redshifts and a large set of simulations to further demonstrate XZ's reliability and to examine its failure rate. Here, we adopt largely the same evaluation metrics used in the original work, but apply new filtering criteria in addition to the IG screening. The final methodology, described in the ensuing sections, should serve as a guide for obtaining reliable redshift estimates for broad X-ray data sets that lack multiwavelength counterparts, which will be found particularly useful for the aforementioned new and future X-ray observatories.

DATA SELECTION
To assemble our data set, we queried the CSC for virtually all obscured AGN detected by the Advanced CCD Imaging Spectrometer (ACIS; Garmire et al. 2003) with sufficient counts, employing three simple selection criteria: Chandra hard band (2.0 keV-7.0 keV) counts > 150 to approximately match the XZ sensitivity threshold, logN H > 22 (as estimated by the CSC's simple absorbed power law model, which assumes z = 0) to filter unobscured AGN, and galactic latitude |b| ≥ 10 deg to remove all sources in the Galactic Disc (using the same X-ray Galactic Disc boundary as Sicilian et al. 2020). This criteria may exclude some obscured AGN, since AGN fitted with parameters Γ < 1.4 and logN H ∼ 21 may in fact be obscured sources (Lanzuisi et al. 2018). However, as the investigation of these potential obscured candidates is outside the scope of this work, we restrict our search only to the CSC-identified obscured sources with logN H > 22. This search yielded >700 sources with a combined >1600 individual spectra. Note that there are more spectra than sources due to the detection of some sources in more than one observation, hence giving multiple associated spectra.
Spectra were obtained directly from the CSC, which provides fully-reduced data products, including source spectra as well as the corresponding response and background files. For sources detected in multiple observations, all available spectral data for each observation was stacked via the Chandra Integrated Analysis of Observations (CIAO; Fruscione et al. 2006) software tool combine spectra to maximize statistics in both the source and background data. The issue of AGN variability among these sources, particularly the effects that stacking variable AGN spectra may have on the XZ results, is addressed in Section 6.3.
This initial data set was cross-matched with several databases in search of multiwavelength counterparts, namely Gaia Data Release 2 (DR2; Gaia Collaboration et al. 2018); the Simbad astronomical database (Wenger et al. 2000), SDSS Data Release 12 (DR12; Eisenstein et al. 2011;York et al. 2000;Alam et al. 2015;Blanton et al. 2017), and WISE (Wright et al. 2010;Cutri et al. 2012) via the CDS X-Match Service 2 (Boch et al. 2012;Pineau et al. 2020). Matched counterparts offered documented redshift values for many of the sources, as well as multiwavelength AGN selection data. We define two data subsets: all sources with documented redshifts in either Simbad or SDSS DR12 compose the "Known-z" set, and all others represent the "No-z" set. Below we detail our AGN selection process.

AGN Selection
When possible, the AGN selection pipeline employed multiwavelength criteria, but in most cases such data were not available or were inconclusive, and hence for those sources we relied on X-ray only selection. Before applying our selection methods, we first made use of parallax data available in Gaia DR2. For sources with counterparts in Gaia DR2, we removed those with parallaxes ≥ 5σ, since all parallax measurements for 556,849 Gaia quasars are less than 5σ (Luri et al. 2018).

Multiwavelength
For each source with an SDSS-matched r-magnitude and WISE-matched spectral data, we first considered the X-ray to optical flux ratio. As thoroughly discussed in various works (see Civano et al. 2012, Marchesi et al. 2016a, Hickox & Alexander 2018, the ratio of X-ray flux (f x ) to optical flux (f o ) can be used to distinguish AGN from stars and galaxies, as the powerful emission from AGN accretion results in much higher relative X-ray flux than typical X-ray sources. Here, in accordance with various works including Civano et al. (2012), we adopt f x /f o = 0.1 as the minimum AGN ratio. The SDSS r-magnitude is plotted against Chandra broad X-ray band (0.5-7 keV) flux in Figure 1, which shows the results of the flux ratio test.
We then considered WISE spectral data-in particular, flux in the W1, W2, and W3 bands (see Wright et al. 2010)-which we then used to perform WISE color analysis. As detailed in various works (see, e.g., Mateos 2014, Nikutta et al. 2014, AGN can be selected by comparing the colors W1-W2 and W2-W3. Here, we Left: SDSS r-band magnitude is compared to the Chandra broad X-ray flux, which indicates the optical-to-X-ray flux ratio selection results. Right: WISE color analysis. In both panels, the selection boundary is marked with dotted lines and passing sources are plotted with blue. Figure 2. Position in galactic coordinates of the 363 obscured AGN selected in our data set. Upward-oriented blue triangles represent the 192 AGN in the final, fully-filtered data set. Downward-oriented gray triangles represent AGN removed from the initial set of 363 during the XZ process.
adopt the color thresholds W1-W2 > 0.5 and W2-W3 > 2.0 to form the boundaries of the AGN selection subspace. The results of WISE selection are also shown in Figure 1.
We required sources to meet both of these selection criteria in order to be designated as multiwavelengthselected AGN, and since those criteria each adopted conservative thresholds, we expect little to no contam-ination in the resulting sample. This yielded 59 multiwavelength selected sources that entered the XZ analysis pipeline, 28 of which remained in the final data set.

X-ray Hardness Ratio
A known X-ray signature of obscured AGN is a relatively high ratio of hard-to soft-X-ray flux, due to the higher rate of absorption for lower-energy X-rays (Mainieri et al. 2002). For Chandra, these soft-and hard-bands are defined as 0.5-2.0 keV and 2.0-7.0 keV, respectively. The corresponding Hardness Ratio (HR) to compare these bands is defined as where H and S represent net source counts in the Hardand Soft-X-ray bands, respectively. Here, for each source, we used the HR provided by the CSC, computed via a Bayesian statistics approach on available Chandra data for a given source to produce a best-fit HR value and confidence interval. The methodology is briefly described on the CSC website 3 and a full explanation can be found in a publicly available CSC memo 4 . As shown in various works (see, e.g., Della Ceca et al. 1999, Wang et al. 2004, Tajer et al. 2007, Brightman & Nandra 2012, Marchesi et al. 2016b, Peca et al. 2021, HR is a useful and reliable method for identifying obscured AGN. One caveat is the time-dependent nature of the Chandra ACIS HR, which is due to the degradation of its soft-band spectral response, which has accelerated in recent years (for detailed discussions, see Plucinsky et al. 2002 andPeca et al. 2021). This work considers only data present in the CSC and hence no observations taken later than 2014 are included, thus mitigating the soft-band's efficiency decline.
We conservatively adopt the threshold HR > −0.1 from Peca et al. (2021), in which the data set consists of observations taken in 2017 (Nanni et al. 2018) and the HR selection threshold is derived via thorough simulations of obscured AGN. Our adoption of this robust criterion, defined using 2017 Chandra data, ensures the integrity of our selection technique, as Peca et al. (2021)'s 2017 data suffers more from the HR degradation than our pre-2015 data set.
We applied our HR criteria to the CSC's best-fit HR value for all sources that did not have sufficient data for the multiwavelength selection pipeline, as well as several sources with inconclusive multiwavelength selection results. The resulting X-ray selected AGN, particularly those with low counts, are the most noteworthy sources in the data set, as their few counts in this single band are expected to provide sufficient data for XZ to estimate their redshifts. This process yielded 304 X-ray selected AGN, giving a total of 363 obscured AGN in our initial data set. The positions of all selected AGN are shown using galactic coordinates in Figure 2. The final, fully-filtered data set contains 164 X-ray selected AGN. In the following section, we test the limits of the XZ method on our initial 363 AGN.

TESTING XZ
Below, we evaluate our adaptation of XZ with its low counts and more general data set. We first consider the Known-z data subset, fitting the sources and filtering the results for high IG (≥2 bits), allowing a practical test on observed obscured AGN candidates. We repeat this process on a large set of simulations, then evaluate error and determine a procedure for filtering poor results out of the No-z data subset.

Known-z Data
The IG-filtered data set contains 302 AGN. Its Known-z subset consists of 105 sources, 85 of which are documented in the Simbad database, while 20 others have spectroscopic or photometric redshifts in SDSS DR12 ("specz" and "photoz" hereafter, respectively) but are not yet included in Simbad.   Simmonds et al. (2018), plotting the XZ best-fit redshift against the documented value. This figure best summarizes our measures of success rates, described in the three different success tiers defined in Table 1. ∼79% of sources achieved basic success, with an overlap between XZ's 1σ uncertainty and the known redshift's ±0.15(1 + z) uncertainty region. ∼68% saw mild success, where the XZ best-fit value lies within the ±0.15(1 + z) uncertainty range of the known redshift. Finally, ∼49% achieved top-tier success, where XZ's 1σ uncertainty range includes the known redshift value. These values are lower than the results seen in the original work (with mild and top-tier success rates of ∼75% and ∼60%, respectively), which was expected due to our broader data set and less conservative counts threshold, accordingly suggesting that further screening may be needed.
We have also defined a criteria for "catastrophic failure," in which the 2σ confidence interval for XZ has no overlap with the 2σ confidence interval for the known redshift, computed as ±0.3(1 + z). For these IG-filtered results, the catastrophic failure rate is ∼11%.
Another measure of success, a commonly adopted metric that is also seen in Simmonds et al. (2018), is the redshift dispersion (∆z adj ) given by where ∆z is computed as This metric directly compares XZ to the known redshift, offering the most direct, straightforward comparison of the best-fit values. The upper panel of Figure 4 shows the distribution of ∆z adj for the high IG sources. We see ∆z adj peak fairly strongly around zero, with a median of ∼0.0, but due to the various failures we see considerable spread in the distribution and a fairly large mean of ∼0.3, which again seems to warrant an additional screening process. The final evaluation metric assesses how well the model constrains the redshift by considering the posterior distribution's uncertainty, σ(z). Here, we define σ(z) as an average symmetric 1σ uncertainty, i.e. onehalf the ±1σ error range. As this metric refers only to the model's redshift parameter, it can be computed for all sources in both the Known-z and No-z subsets. Figure 4 shows the distribution of σ(z). Here, we see a median and mean of ∼0.13 and ∼0.19, respectively, and a fairly strong tendency towards zero. These results are consistent with the ∼0.2 found in Simmonds et al. (2018), which is a good result but is likely a trivial consequence of adopting the same IG threshold, as a high IG is correlated with low uncertainty, with IG > 2 bits roughly corresponding to σ(z) < 0.4.
In spite of the apparent insight offered by these results, it is clear from Figure 3 that the Known-z subset is heavily dominated by low-redshift, minimally obscured sources. This is likely the explanation for why our success rates are lower than those found in Simmonds et al. (2018), as that work considered higher redshifts and obscuration levels by employing CDF-S. CDF-S and other deep fields are abundant with faint AGN, which are generally high-z or highly-obscured. However, here we considered a counts-selected data set from the general archive which is fundamentally dominated by lower redshifts. Therefore, while we continued to take Knownz into consideration, particularly during the adoption of new filtering techniques, we turned to simulations for a more robust analysis of XZ's capabilities.

Simulations
For a more comprehensive test of XZ, we simulated 1000 obscured AGN. We adopt the convention from both Simmonds et al. (2018) and Peca et al. (2021) in which Γ is fixed at 1.9. The torus geometry parameters largely resemble those used in Simmonds et al. (2018), with the opening angle θ op fixed at 45 • and the torus viewing angle θ view fixed at 80 • (approximately edge-on). Note that, as in Buchner et al. (2014a) and Simmonds et al. (2018), those geometry parameters are fixed in the BXA fitting procedure as well, as neither this work or those works sought to further constrain the geometry of each obscured AGN. As in Simmonds et al. (2018), we also froze the scattering fraction (f scat ), but instead of using f scat = 10 −2 from Simmonds et al. (2018), we chose f scat = 10 −4 to best reflect our data set, namely the mean (10 −3.7 ) and median (10 −4.2 ) of best-fit values for all spectra. Each simulation was assigned a random set of obscuration, redshift, and counts values, constructed to cover similar ranges of obscuration and redshift values as our data set, and to resemble its counts distribution. Specifically, we used uniformly random values for logN H between 22 and 26 and also uniformly random values for redshift between 0.01 and 3, both in steps of 0.01. The allowed values for simulation model parameters of interest can be found organized in Table 2, which also shows the priors used for the BXA fitting method. Counts were assigned randomly, but with a non-uniform probability designed to closely emulate the counts distribution in Uniform from 20 to 26 Γ 1.9 Gaussian (µ = 1.95, σ = 0.15) fscat 10 −4 Uniform from 10 −7 to 10 −1 θop 45 • θview 80 • - Table 2. The allowed parameter values in the simulated spectra, as described in the text. Free parameters are given using interval notation in the top panel, while fixed parameters are given in the bottom panel. Torus geometry priors are absent since they are also fixed in our BXA fit procedure.
the data set ( Figure 5). Background spectra were simulated by sampling a background file randomly selected from sources in the data set with multiple observations, as all such background files were stacked from the constituent observations. This ensured sufficient statistics to produce a viable and realistic background spectrum.
For each simulation, we used the response files from the source whose background file was randomly selected for that simulation's background spectrum. Using the same metrics described in Section 4.1, we can statistically quantify the capabilities of XZ. In this case, the only substantial limitation of our evaluation is systematic bias inherent and virtually unavoidable in these simulations. However, due to the large number of simulated sources, their wide parameter ranges, and the physically descriptive model on which the simulations were based, we consider the spectra to be quality test subjects for XZ. More importantly, we consider them to be better estimators of XZ's competence than the sparse results offered by smaller, less diverse data sets such as Known-z.
We ran the XZ procedure on the simulated data set, applying the IG ≥ 2 filter to the results, leaving us with 942 simulated spectra. The panels in Figure 6 show the XZ redshift (left) and N H (right) plotted against the simulated input values. Using the same definitions of success given in 4.1, we see basic success in ∼85% of the sources, mild success in 82%, and top-tier success in ∼54%. The ∆z adj and σ(z) distributions are shown in Figure 7. XZ's performance on the simulated data, then, is noticeably better than its performance on the Known-z sources, so it could be argued that this is evidence of a major systematic issue. However, it is critical to emphasize that XZ's accuracy on the simulated data set is fairly consistent with Simmonds et al. (2018) (with the simulations' mild success rate exceeding that of Simmonds et al. 2018 by ∼7%, but its top-tier success rate falling ∼6% short of that found by Simmonds et al. 2018), and also that the simulated data set has a broad range of redshifts and obscurations like the deep surveys considered by Simmonds et al. (2018). As discussed, Known-z is dominated by low-z and low-N H sources, so XZ's performance on that data set was expected to be less favorable than on the simulations.
In spite of the promising success of XZ on the simulations, we take a greater interest in the failures, as analyzing the nature of these failures and understanding how to predict them can be vital to interpreting results when XZ is run on sources with no known redshifts. Detailed below is our approach to this analysis, the goal of which was to establish a method for evaluating errors on XZ fits a posteriori when a priori information, such as known redshift, is not available.

XZ FAILURE ANALYSIS
To examine the reliability of XZ and the factors contributing to its success or failure, we established a single pass/fail metric to classify each fit. Namely, if the XZ result on a given source meets either the mild or toptier success criterion (using the conventional inclusive "or") defined in the previous sections, we classify the fit as positive ("pass", +). The basic success criterion is not considered sufficient due to the large and irregular disagreement it allows between the best-fit values of XZ and known z. To maximize the size of the training and testing sets, and to mitigate systematic bias from using purely simulated data, we combined the IG-filtered simulated data set with IG-filtered Known-z, producing a new master set of 1047 sources. We found the overall pass rate to be ∼81%, hence making the failure rate ∼19%.

Parameter-Dependent Failure Rates
We first performed a basic analysis of the error rates in the combined simulated and Known-z data set as functions of various spectral characteristics, namely the known z and N H values, IG, counts, reduced fit statistic, the XZ best-fit redshift and N H values, as well as σ(z) and σ(N H ). This set of source and fit characteristics was determined by assembling all attributes that we qualitatively expect to substantially impact the results, either physically or statistically, excluding only the power law photon index Γ (and its associated uncertainty) due to its controlled value in the simulations. Otherwise, all characteristics potentially meaningful for the redshift results were included. Failure rates for all assessed characteristics are shown in Figure 8.

Known Redshift and Obscuration
The failure rates associated with the known z and N H values are largely unsurprising. Redshifts from ∼0.75 to ∼2.5 all registered below-average failure rates, which is expected from the sensitivity limits determined in Simmonds et al. (2018). The model is not sensitive to nearzero redshifts, and high redshift sources are typically faint with model-critical features redshifted out of Chandra's effective energy band.
Similarly, logN H yielded mostly below-average failure rates between 22 and 25, again consistent with the Simmonds et al. (2018) findings. Spectra with both low and extremely high N H are difficult to distinguish from an unabsorbed power law in the Chandra energy band, depriving the model of the prominent photoelectric cutoff absorption edge vital to its redshift measurement. These results, while consistent with predictions, are not able to help make new predictions about the XZ method, particularly in cases without known values from other observations in which our goal is to deduce a posteriori conclusions about XZ model results. Therefore, we aim to explore only features knowable or obtained via the fitting process alone.

Fit Characteristics
The most prominent trend in XZ fit characteristics is the decrease of failure rate with an increase of redshift information gain, but otherwise only sparse relationships are evident. While we could simply apply a much stricter IG filter, increasing the threshold to ∼5 bits, this would leave us with just ∼17% of the initial data set (and only ∼20% of the IG ≥ 2 bits data set), potentially removing hundreds of successful fits in the process.
Another possible filtering method given the results would be to simply remove any source such that its value for a given attribute falls into one of that attribute's bins with an above-average failure rate. While this may be  effective for our data set (though it would still likely suffer from removing too large a portion of the total fits), the goal in this work is to produce a more general methodology that can be applied to future large-scale X-ray redshift measurements. Therefore, our filtering procedure must be transferable to the analysis of any X-ray data set, and hence cannot be tailored to the specific errors found here.
The clearest and most plausible explanation for the absence of obvious effective filters based on these error rates is the dependence of the fit's success on the entire aggregate of these characteristics. Therefore, rather than attempt to sort out the co-dependencies of the fit attributes directly, we turned instead to machine learning to identify successful fits.

Machine Learning Analysis
We employed a multi-layer perceptron (MLP) neural network, part of the Scikit-learn Python software package (Pedregosa et al. 2012), due to the MLP's ability to solve difficult non-linear problems, as well as its effectiveness at classifying categorical variables (Ciaburro et al. 2013) such as the pass/fail criteria we apply to XZ. The power of a neural network arises from its forwardpropagation of input through a non-linear activation function contained in each constituent neuron. Neurons are connected such that a neuron in a given layer will receive input from every neuron in the previous layer, and send its output to every neuron in the ensuing layer. Together, the neurons produce a final model that can approximate any arbitrarily complex function (Hornik   al. 1989). Therefore, in principle, a neural network is capable of solving any classification problem, a result known as the Universal Approximation Theorem.
We constructed our neural network with five layers of neurons (thus meeting the general criteria for "deep learning"), including the input, three hidden layers (containing 1200, 800, and 400 neurons, respectively), and the output, which yields the classifier. For our activation function, we used the highly successful rectified linear unit function (ReLU), the most widely used activation function in deep learning (Hara et al. 2015, Nwankpa et al. 2018. For more detailed discussions of the Scikit-learn software and MLP neural networks, see Murtagh (1991), Pedregosa et al. (2012), and Ramchoun et al. (2016).
To obtain a training set and a testing set, we performed a random split of the combined Known-z and simulated sources. We chose to allot ∼50% of the data to the training set, which should provide a balanced sample as indicated by the results of Dobbin & Simon (2011) and Xu & Goodacre (2018). We assigned a binary class to each source, the simple pass/fail as defined earlier in this section. Therefore, the MLP should produce a model that will classify a given source's XZ redshift estimate as either pass (+) or fail (−). The algorithm was trained using the seven a posteriori source characteristics given in Section 5.1.2 as the features, namely XZ redshift, XZ logN H , σ(z), σ(N H ), CSTAT ν , IG, and counts. All were rescaled such that each feature had a mean of zero and unit variance. The resulting model was applied to the testing set to evaluate its accuracy. The results of these tests are shown in Table 3.

Performance on Total Testing Set
The most important performance metric, for our purposes of identifying good XZ redshift values, is the probability that a fit classified as successful (test+) is indeed a good estimate of the redshift (is+). This is a common metric (e.g., Waller & Levi 2021, Bentley 2021 known in machine learning as "precision," and can be computed using Bayes' Theorem (Bayes 1763) in the following manner: Since precision gives the probability that a positivelyclassified source is truly a good redshift estimate, it is therefore the most crucial accuracy measure of our machine learning filter, as it represents the percentage of the testing set that would correctly estimate the redshift if we filtered out all spectra classified as failures (−). I.e., it would be the filtered testing set's pass rate, and therefore offers an estimate for the performance we could expect on a general data set when filtered accordingly using the fitted classifier. As shown in Table 3, we find the precision to be ∼89% in the testing set.
To evaluate this finding, we can compare the precision to the testing set's original pass rate, which was ∼83%, by performing a simple two-proportion z-test. Our classifier yields a statistically significant ∼3σ improvement over the initial pass rate, therefore offering strong evidence and motivation for applying the MLP classifier as a new filter. Note that the pass rate of ∼81% (and corresponding fail rate of ∼19%) described in Section 5 and reflected by Figure 8 refers to the initial pass rate of the entire combined set of simulations and Knownz sources, and therefore differs slightly from the initial ∼83% pass rate of the testing set due to a slight statistical fluctuation in the train/test splitting process.
More rigorously, this approach is equivalent to considering the existing IG ≥ 2 filter as the null hypothesis classifier, which would result in every source in the testing set being accepted as a good redshift estimate. In this case, we would compute the corresponding precision and compare it statistically to that of our MLP classifier. Furthermore, it can be shown using Equation 4 that our choice of the null hypothesis is mathematically arbitrary because any random classifier (e.g., giving each source a statistically independent, random 50% chance of passing) would yield the same precision as allowing all sources to pass (which, in this context, is equivalent to giving each source a statistically independent, "random" 100% chance of passing). This is why, regardless of the approach, the null hypothesis precision is exactly equal to the pass rate of the testing set, and therefore allows for the intuitive comparison between the initial pass rate and the model precision.
As shown in Table 3, we also considered other performance measurements to further evaluate the classifier, particularly its ability to correctly identify good redshift estimates (+ cases). These performance findings on the total testing set were also favorable. The model had an accuracy score of ∼79%. The accuracy score of a classifier is computed by dividing the number of correct classifications by the total number of sources in the testing set, which indicates the classifier's overall ability to classify both + and − cases, and thus ∼79% is a fairly positive result. In addition, we found the recall to be ∼85%. Recall (also known as the "true positive rate") is the probability that a good redshift estimate (is+) will be classified as such (test+). Note that this differs from precision; in particular, recall is the P (test + | is+) term in Equation 4. Therefore, although recall is less crucial to interpreting the filtered data set than preci-sion (which, as discussed, gives the percentage of good redshift estimates in the filtered data set), it still offers a similar insight into the classifier's performance by showing its ability to correctly identify good redshift estimates by giving the percentage of those good estimates in the data set that are classified as such by the model. An equivalent interpretation is to consider that 1−recall is the proportion of good redshifts that will be erroneously filtered out. Both interpretations make it clear that the recall should be close to 1, so our ∼85% for the total testing set shows good model performance. Moreover, the model also exhibited a favorable receiver operator characteristic curve (ROC curve; Figure 9), which examines the relationship between a classifier's rates of true positives and false positives. The area under the ROC curve (AUC ROC) is ∼0.78, indicating the classifier's tendency towards true positives and away from false positives.

Performance on Subsets of Testing Set
Due to the small number of Known-z sources (60), as well as its mostly low-redshift sources, the classifier's Known-z performance was generally less favorable than that of the total testing set. Most notably, the crucial precision statistic is ∼83%. While this demonstrates a noticeable improvement over the null value of ∼75%, the sparse size of the Known-z testing subset renders this a statistically marginal ∼1σ improvement. While this is not as statistically significant as the total testing set's precision, for further insight the Known-z results can be compared to simulations with comparably low redshifts, also found in Table 3.
We defined low-z simulations such that the XZ redshift ≤ 0.593, which is the 75 th percentile of Known-z redshifts to offer a comparison between Known-z and a set of similar simulated spectra. Like Known-z, the set of low-z simulations is small, with just 80 sources. The significance of the precision on these low-z simulations is ∼1σ, similar to the performance on Known-z. The high-z simulations, on the other hand (which include all simulations with XZ > 0.593), display ∼3σ significance, resembling the classifier's performance on the overall testing set. These similarities are further illustrated by Figure 9's ROC curves, where the total testing set's curve strongly resembles that of the highz simulations, while Known-z's curve closely resembles that of the low-z simulations.
Together, this evidence suggests that the classifier's reduced performance on Known-z is largely to be expected from XZ's generally lower compatibility with lowredshift sources, as well as the small number of Knownz sources. In addition, a machine learning testing set (or meaningful testing subset) would ideally employ far more than the 60 sources Known-z contains. However, despite Known-z's small size and low redshifts, the classifier was still able to offer statistically marginal but non-negligible improvement which, in conjunction with the ∼3σ improvement on the total testing set, shows that our model can serve as an effective filter on XZ results.

Feature Importance Evaluation
A fundamental aspect of any machine learning algorithm is the selection of features, to optimize the model by using only those that are informative and excluding features irrelevant to classification (see, e.g., Langley 1994). We conducted two separate analyses of our features, first by computing the amount of classificationrelevant information contained in each, then by performing an ablation analysis.
For the information content approach, we used the method presented in Kubat (2017) and described below. In that work, a feature's information content (I f ) is defined as:  Table 3. The results of the machine learning analysis. The performance is assessed on the entire testing set, as well as on three of its subsets. Nsrc is the number of sources in each data set.
where H T is the total statistical entropy of the training set and H f is the entropy of the feature. Statistical entropy (H) is defined in Kubat (2017) as: where P + and P − are the pass and fail rates of the considered data set. After computing the total entropy of the training set, H T , we then considered each feature. We binned that feature's values in the training set such that the number of sources in each bin (N i ) was at least 5. The entropy was computed for each bin (H i ) and a sum over all bins yielded the total statistical entropy of the feature, H f , given by: where N T is the number of sources in the training set.
We then obtained the relevant information contained in each feature using Equation 5. As shown in Table  4, the information contents of our initial features are approximately equal, with all such features containing ∼0.2-0.3 bits. This suggests that all features in our initial set are sufficiently information-dense to warrant inclusion in the model. For the ablation analysis, we removed each of the 7 features individually, fitted the resulting 6-feature model, and compared its performance to that of the full model. As shown in Figure 10, no such iterations yielded considerably poorer results, suggesting that all features are of comparable importance. This is consistent with the findings of our information content analysis, and therefore we conclude that our 7 features each demonstrate sufficient importance for inclusion in the model.

Cross-Validation
Our ∼500-source training set is small in the context of neural networks, particularly deep learning, in which data sets ideally contain many thousands of examples. This makes our relatively small training set sensitive to the two randomized processes involved in fitting the model, namely the random split of the initial data into the training and testing sets and the initialization of the neural network's fit parameters ("weights"). The Known-z subset is especially sensitive, as it composes just ∼10% of the data. To ensure that our favorable model performance did not arise from unusually convenient splitting or initialization, we employed two crossvalidation procedures to explore the impact of varying these randomization processes. Cross-validation is the practice of training and testing the model multiple times (e.g., using a different split of the data into training and testing sets each time) and comparing the results to investigate any associated effects on the performance.
Each random process is "seeded" by a random state integer. Our first method of cross-validation involved holding one of the random state integers constant while varying the other on the interval [0,50). The second method involved 1000 iterations of selecting unique pairs of random state integers, each independently randomized on the interval [0,50). In both methods, performance was evaluated using the statistical significance of the resulting precision on both the total testing set and the testing set's Known-z subset. As shown in Figure  11, there is substantial variance in model performance, as anticipated due to our relatively small data set.
However, the median significance is nearly ∼3σ for the total testing set and ∼1σ for Known-z, which is highly consistent with the results of our final MLP model. Indeed, its performance falls well within the 1σ ranges Figure 10. Results of the ablation analysis showing model performance when each feature is removed, in terms of the four accuracy metrics used in Table 3. Here, all metrics are given as decimals.
of the corresponding cross-validation distributions, further instilling confidence in our results. Furthermore, the distinct distributions representing the total data set and the Known-z sources confirms our previous suspicion that the classifier's lower significance on Known-z compared to the total testing set is systematic, which we hypothesized as arising from Known-z's low redshifts. In any case, due to our reported model's consistency with the mean performance of models found in these crossvalidation analyses, we can thus move forward with that model.

Comparison to Other Models
The final method we used to evaluate our model was to compare it with other models. We considered two additional formal frameworks (both implemented using Scikit-learn), namely logistic regression and random forest models. For each framework, we considered three sets of features (Table 5): the set of 7 features present in our final model ("final"), a set with three additional features ("all features"), and that same set of 10 features except with XZ and Counts replaced by log(1 + XZ) and log(Counts), respectively ("all features and substitutions"). We also considered 2 variations of the MLP neural network by using the "all features" and "all features and substitutions" feature sets.
A full breakdown of all models' performances can be found in a table in the Appendix, along with direct com-parisons to our final model's performance metrics. The ROC curves for all models are shown in an accompanying figure. Also found in that figure is a manuallyconstructed ROC curve, achieved by varying the IG threshold from 2-6.7 bits (reflecting the discussion in Section 5.1.2), which offers a preliminary baseline for comparison with our model before considering the other machine learning frameworks. This curve shows that filtering with higher IG delivers a less favorable AUC ROC than our final model which, in addition to our previous finding that a higher IG filter would sacrifice a major percentage of the data, further motivates our choice to abandon additional IG filters in favor of machine learning.
Logistic regression in machine learning generally involves fitting a logistic function to two-class training data to yield a classifier model (see, e.g., Dreiseitl & Ohno-Machado 2002). Its simplicity makes it a good baseline for evaluating more complex algorithms such as our deep neural network. For example, if a neural network's performance is only marginally better than logistic regression, it would suggest that the more complex model was not necessary for modeling the classifier. As shown in the Appendix, none of the logistic regression models rivaled the significance of our model's precision on the total testing set or on the Known-z sources. Other accuracy metrics were comparable, and in some cases better, but the substantially worse precision performance justifies the use of a complex model.
A random forest classifier repeatedly samples portions of the feature space and subsets of the training set to build an ensemble of decision trees, which represents the "forest," and averaging these trees yields the classifier (see, e.g., Abdulkareem & Abdulazeez 2021). Like logistic regression, random forest is generally simpler and less computationally expensive than a deep neural network, allowing it to serve as another baseline for evaluating our neural network. The random forest models performed better than logistic regression in terms of precision, but fell short of our MLP's success on both the total testing set and the Known-z sources, further justifying our choice of a more complex model.
The two alternate MLP models yielded results more similar to that of the final MLP, but still did not match our model's performance. While the "all features and substitutions" MLP very narrowly outperformed our model's precision on the total testing set (achieving a value of ∼89.5%, an improvement of ∼0.2% over our MLP's ∼89.3%), the "all features" MLP fell slightly short and neither of the alternate MLPs was able to match our model's precision on Known-z (though it should be noted that the "all features" MLP's Known-z precision, despite being lower than our MLP's by ∼0.2%, had a negligibly higher significance stemming from its higher recall). Therefore, we consider our final MLP model's overall performance to be the most favorable among all models considered.
It is important to note that the slightly poorer performances by these MLPs on may simply be a result of the data set's previously discussed sensitivity to model initialization and train/test splitting, so it does not imply that they will always perform definitively worse than our final MLP. However, due to the slight perceived advantage of our model, and to minimize the dimension of the feature space, we conclude that the 7-feature final MLP is the best among all models considered, and the trained model has been made publicly available for download. 5

Applying the Model
Upon applying this finalized MLP classifier as an additional filter on the data set by keeping all positivelyclassified sources and removing all negatively-classified sources, we saw retention of nearly two-thirds of the IGfiltered data set, keeping ∼64% of those sources. The combined IG and MLP filters ultimately left us with ∼53% of the initial data set, preserving the majority of 5 https://github.com/DominicSicilian/XZ MLP all AGN originally considered.While cutting down the data set to just over half its original size may, at first, appear detrimental, it is important to note that the two layers of filtering (using IG and the MLP classifier) are specifically designed to, ideally, remove only the poor redshift estimates.
So, to put these retention values more rigorously into context, and in particular to evaluate how well the MLP filter retains good redshift estimates, we can use the recall of the machine learning algorithm on the testing set, as it directly gives the percentage of good redshift estimates that were correctly classified as such in that data. While this can only be computed unambiguously for the testing set, it serves as a good estimate for the No-z data set. As seen in Table 3, the model's recall on the testing set was ∼85%, indicating good overall retention of + cases. Specifically, it suggests we may have lost ∼15% of good redshift estimates in the original data set. Taking both the recall and precision on the testing set into account, we see that, while ∼15% of good estimates may have been lost, the trade-off is that the success rate of the resulting data set increased by a statistically significant amount (∼3σ over the initial pass rate, from ∼83% to ∼89%, as described in Section 5.2.1). Therefore, while we may have sacrificed a small amount of good XZ redshift estimates, we eliminated enough poor estimates to obtain a substantially cleaner overall set of redshifts. A breakdown of success rates and source retention percentages among Known-z sources across the different data sets and subsets is discussed below and can be found in Table 6.

Fully-Filtered Known-z
Here, we briefly revisit the Known-z results upon applying the new MLP classifier. Since a large portion of Known-z was included in the training set, examining the filtered Known-z results is not particularly profound, but can at least serve as another realization of the algorithm's success. The primary illustrations of this success are the improved mean of ∆z adj to ∼0.09 (from ∼0.29; still with a median of ∼0.01; see the left panel of Figure 12 compared to the top panel of Figure 4) and the improvements of all accuracy tiers (visualized by right panel of Figure 12, compared to Figure 3).
In particular, the basic success rate has risen to ∼92% (from ∼79%), mild success increased to ∼82% (from ∼68%), and top-tier success reached ∼56% (from ∼49%). The catastrophic failure rate was reduced to less than half its initial value, as it decreased to ∼4% (from ∼11%). Most notably, the ∼82% mild success rate now exceeds Simmonds et al. (2018) Table 5. The three feature sets used for model comparison. "Final" refers to the 7 features used in our final model. "All Features" refers to the final model's 7 features plus three additional features: source-to-background counts ratio (S/B), the log of exposure time (logTe), and HR. "All Feat. & Sub." refers to the same 10 features, except XZ is substituted with log(1 + XZ) and Counts is substituted with log(Counts). approximately matches that of Simmonds et al. (2018). As discussed, this data set was far less favorable for XZ than that of Simmonds et al. (2018), so seeing comparable success rates between the two data sets after applying our new filter is an excellent demonstration of the machine learning classifier's capabilities. Furthermore, when considering Section 5's pass/fail criteria (as used in the XZ failure analysis and machine learning procedure) for Known-z, we find that after applying the MLP filter, the pass rate increased considerably, from ∼70% to ∼85% (also note that all performance metrics discussed here can be found organized in Table 6).
Therefore, taking all of this into consideration, we conclude that at the cost of the MLP classifier removing a small minority (approximately ∼15%, as estimated in Section 5.3) of good redshift estimates, it provided the great benefit of elevating XZ's performance to match Simmonds et al. (2018) in the AGN that were retained, despite the abundance of sources with redshifts and obscurations that fall outside XZ's ideal ranges.

Caveats
While we draw positive conclusions from the simulations, XZ failure analysis, and the machine learning procedure, we acknowledge several caveats that have not yet been fully addressed. Below, we present and discuss these caveats, which are grouped into two categories, namely Model Systematics and Training Set Size.

Model Systematics
As mentioned in Section 4.2, since the simulations are 1.) based on the physically descriptive Brightman & Nandra (2011) model, 2.) more numerous than the relatively small Known-z data set, and 3.) contain a more uniform range of parameter values than Known-z, we consider the simulated data to be a far more robust resource for all of the resulting analysis. However, it could be argued that the implementation of the model in the simulations is too simple, namely that there are few free parameters (e.g., with a frozen Γ = 1.9), which potentially diminishes the physical accuracy of the simulated spectra. Additionally, those concerns could be extended to the use of the simulations in the analysis since, as we have acknowledged, it may introduce some level of systematic bias, particularly when training the neural network on a simulation-dominated data set. It is possible, then, that the neural network would learn more about how to identify XZ successes in the simulated data than in actual observed AGN.
We believe all such concerns have merit, but do not undermine the results presented here. Although there are, undoubtedly, more problem classes that could be investigated and resolved-either with more realistic simulations or real test data-this work already resolves an interesting and important class of failures that occur both in our simulations and, as expected, in real data, such as the Known-z spectra we have analyzed here. This is evidenced by the improvement of accuracy on the Known-z sources after applying the MLP classifier, where (as shown in Table 6) the pass rate (in terms of Section 5's pass/fail criteria) greatly increased from ∼69.5% to ∼84.5% and all tiers of success rates (as defined in Table 1) have been improved to levels consistent with the findings of Simmonds et al. (2018). Even when isolating only the Known-z sources in the testing set (to avoid bias from sources that helped train the MLP model), we showed in Table 3 that there was a statistically marginal (∼1σ) yet noticeable increase in the pass rate (from ∼75.0% to ∼82.9%).
Furthermore, specifically regarding the simulated models, we closely followed precedent set by, e.g., Simmonds et al. (2018) and Peca et al. (2021) in our choices of frozen parameters. In the case of Γ, we froze the parameter at 1.9 just as both Simmonds et al. (2018) and Peca et al. (2021) have done, largely due to its degeneracy with z and N H , both of which are already degenerate with each other. Such degeneracy is minimized in simulated data by eliminating Γ as a free parameter to focus primarily on z and N H . In addition, as suggested by Peca et al. (2021) (and others ref. therein), given the low-counts nature of the majority of our data set, it is possible that using a more complex model would not have a positive impact on the work. Specifically, Peca et al. (2021) discusses the negative impact that more complicated models can have on estimating X-ray redshifts. Therefore, taking all concerns and counterpoints into consideration, we conclude that the lingering systematic effects of the simulations and models do not cause a substantial detriment to the analysis within the scope of this work and the problem classes it studies.

Training Set Size
An issue we have discussed is the size of our training data set, which we accounted for through our crossvalidation analysis, but we did not yet fully address it at its roots. As mentioned, it is a relatively small training set for a deep neural network, which is a consequence of only producing 1000 simulations. Hence, it could be argued that this issue could be rectified by simply producing a larger set of simulations.
However, we do not believe a larger set of simulations would substantially improve the results. This is due to the limited number of possible meaningfully different simulations. As described in Section 4.2, there are three free simulation inputs, namely counts (effectively spanning 150 to well over 5000, in increments of 1), z (spanning 0.01-3 in increments of 0.01), and logN H (spanning 22-26, in increments of 0.01). While there are, therefore, thousands of unique combinations of those inputs, our analysis in Section 5 suggests that there are relatively few meaningfully distinct simulations (i.e., distinguishable by the XZ procedure in its successes and failures).
It is clear by visual inspection of the input counts, z, and logN H XZ failure rote plots in Figure 8 that there is a limited number of distinguishable failure cases. Specifically, there appear to be roughly 5-11 unique intervals of counts showing changing failure rates compared to surrounding intervals. Likewise, there appear to be approximately 3-12 such intervals for z and around 5-15 for logN H . For example, with z, there are above-average failure rates between 0-0.75, below-average rates between 0.75 and ∼2.25, and above-average rates above 2.25, with some additional possible sub-classes of intervals viewable within each of those intervals.
From these estimates, we can approximate that there are between ∼75 and ∼1980 combinations of input parameters that are distinguishable by XZ in terms of its failure rate. Our choice of 1000 simulations, then, falls almost exactly in the center of this range, meaning that it approximately represents the maximum number of meaningfully different spectra. Therefore, if we were to produce substantially more simulations, particularly if we increased the training set size by orders of magnitude, the vast majority of them would effectively be duplicate entries in the training and testing set. This would be highly detrimental to the machine learning algorithm, as it would cause the model to become overfit to the same set of ∼1000 combinations we have already evaluated, thus depleting its predictive power on general data sets. Incidentally, this issue, too, could be further investigated using more realistic, more complex models, as the larger number of free parameters may offer a more robust set of meaningfully different spectra.

Takeaways
While these counter-arguments do not wholly eliminate our aforementioned concerns about systematic bias towards the simulations, the observed effectiveness of the machine learning model on the Known-z data set shows that the procedure in its current form provides useful results within the scope of the work. Due to the clear success, despite lingering concerns, we consider this analysis to be a strong foundational work for future studies of the XZ method and X-ray redshifts that can build upon the procedures employed here.

RESULTS
Below, we describe our results. In Section 6.1, we provide the X-ray redshift catalog for our No-z sources, followed in Section 6.2 by a breakdown of the total combined Known-z and No-z data set to further validate and characterize the methodology.

Redshift Catalog for No-z
After applying the MLP classifier to the IG-filtered No-z, we were left with a final No-z data set of 121 sources, all of which were X-ray selected. Given the performance of our MLP classifier on the total machine learning testing set, Known-z sources in the testing set, and the total Known-z data set, we can estimate that between ∼82% and ∼89% the redshifts documented in this catalog are consistent with the mild-to top-tier success standards. The distribution of the catalog's redshift values in the final data set is given in Figure 13, which shows that we find redshifts mostly between zero and ∼1.2, with various higher redshift results, up to z ∼ 5. We find 74 new redshifts between 0.5 and 3.0, a notable result since obscured phases in this range are where most supermassive black hole growth occurs (see, e.g., Ueda et al. 2003;Treister et al. 2012;Buchner et al. 2015), making our catalog abundant with ideal sources for studying AGN accretion and evolution. We report this catalog of redshifts in the Appendix, where we offer a basic breakdown of the sources, including key features such as the CSC source names, coordinates, counts, and various model characteristics.

Characterizing the Overall Data Set
In Table 6, the number of sources in the data set at each stage of filtering is given. The Known-z success rates and catastrophic failure rates described above are also given in Table 6. In addition, the aforementioned Figure 13 contains the total MLP-filtered data set's N H distribution, showing that most sources have a logN H between 22 and ∼24. Some sources exhibit highly-obscured logN H ∼ 24, and the distribution has a median value of logN H ∼ 23. We found one possible Compton-thick candidate, which we analyze and discuss further in Section 7.2.
The total data set's σ(z) after the machine learning filter is shown in Figure 14, which shows slight improvement over the initial IG-filtered distribution (Figure 4), with the mean improving from ∼0.19 to ∼0.15 and the median improving from ∼0.13 ∼0.11. As previously discussed, the strong tendency towards zero, median of ∼0.1, and mean of ∼0.2 are consistent with the expectations set by Simmonds et al. (2018) and are largely controlled by filtering for high IG. Another parameter of interest shown in Figure 13 is Γ, where we find a strong tendency towards the median value of ∼1.9, which is consistent with expectations for obscured AGN. This is a somewhat trivial result, as XZ uses a Gaussian distribution centered at 1.95 as the prior for Γ and low-count spectra may not substantially inform Γ. The N H and Γ distributions have been revisualized in Figure 15, where they are plotted together.
For the best-fit parameters, reflected in Figure 13's histograms, it is important to note that these values are, themselves, the medians of distributions. In particular, BXA's Nested Sampling fitting algorithm yields a posterior probability distribution for each parameter, and the best-fit value is simply the median of that parameter's posterior distribution. To fully and properly visualize the parameters, the Appendix contains a figure showing the posterior probability distributions for XZ, logN H , and Γ for all sources in the final data set.
The goodness of fit is also highly favorable in general, shown in Figure 14. The reduced fit statistic (χ 2 ν ) converges well around a median of ∼1.16, with few poor fits. Most fits exhibit χ 2 ν < 1.4, with the vast majority of all sources being fitted within the ∼90% confidence interval, which corresponds to χ 2 ν < 1.7 and offers further confidence in our results. Figure 13. The distribution of XZ-modeled best-fit parameter values for the fully-filtered data set. For a given spectrum, the best-fit value of each parameter is the median of that parameter's posterior probability distribution. Parameters shown are Upper: logNH , Middle: redshift, and Lower: power law photon index (Γ). The posterior distributions from which these best-fit values were obtained can be found in the Appendix. Figure 14. Upper: The XZ redshift uncertainty distribution for the total fully-filtered data set, using the same format as seen in the lower panel of Figure 4. Lower: The distribution of the reduced test statistic χ 2 ν . The median and mean values are represented by solid and dotted lines, respectively.

AGN Flux Variability
A possible objection to our handling of data, alluded to in Section 3, is the stacking of spectra for sources with multiple observations ("multi-spectra sources" hereafter) due to the potential for some such sources to be variable AGN. The goal of the spectral stacking process is to optimize the XZ modeling results by maximizing the counts for a given multi-spectra source, but it is possible that the results may instead be negatively impacted if the source is a flux-variable AGN, which could potentially lead to issues such as a poorly-constrained redshift or an inaccurate redshift estimate.
In our data set, however, the overall trend suggests that any XZ modeling issues related to AGN variability are non-existent or minor, and that the increase of statis- Figure 15. The distribution of best-fit XZ logNH vs. Γ for the total fully-filtered data set.
tics from the stacking process outweighs any such issues. This is demonstrated in Figure 16, which shows the proportion of multi-spectra sources and variable multispectra sources (as identified by the CSC master source variability flag) in the data set, both for the initial data set (after AGN selection, but before any XZ modeling or filtering) and the final data set (the fully-filtered data).
As seen in the figure, the data set was initially ∼30% multi-spectra sources, a proportion that increased to ∼38% in the final data set. Since the final data set has been filtered using IG, which generally removes poorly constrained redshifts, and using the MLP classifier, which removes inaccurate redshift estimates, the increased proportion of multi-spectra is consistent with expectations that the increased counts of multi-spectra sources would lead to more favorable XZ results. This is evident in Table 7, which shows that multi-spectra sources have substantially higher mean and median counts than single-spectrum sources in both the initial and final data sets.
The proportion of variable AGN in the total data set also increased, from ∼23% to ∼28%. Moreover, the proportion of variable AGN among the multi-spectra sources saw only a slight decrease (from ∼77% to ∼74%) consistent with a statistical fluctuation (∼0.3σ), showing that the variable multi-spectra AGN demonstrated  Initial  992  413  9777  2392  Final  1436  542  8115  2519   Table 7. Mean and median counts in the 0.5-7 keV Chandra broad band for sources with one observation (Single Spectrum) and with multiple stacked spectra (Multi-Spectra), given for both the initial and fully-filtered final data sets. Figure 16. Proportion of multi-spectra sources and multispectra variable sources in both the initial and final data sets.
approximately the same level of XZ success as their nonvariable multi-spectra counterparts. These results suggest that if stacking variable AGN spectra has a negative impact on the XZ modeling process, any such impact is minor and overcome by the increased counts in multispectra sources. Therefore, we conclude that AGN variability introduces no serious issues to our data reduction or analysis procedures. This finding is strongly reinforced by the fact that Simmonds et al. (2018) made extensive use of stacked spectra in the original formulation of XZ.

BEYOND REDSHIFTS
Thus far, we have been almost exclusively concerned with measuring and estimating redshifts. Here, we offer a discussion of results focused on other parameters. We first examine our general obscuration findings in Section 7.1, then focus on the possible Compton-thick candidate in Section 7.2.

CSC versus XZ: Obscuration
In Figure 17, we have revisualized the basic N H distribution from Figure 13 by comparing the XZ-fitted values to those documented in the CSC. In particular, we used the CSC per-observation N H for each spectrum, taking the mean value for sources with multiple spectra, though typically the per-observation N H value is virtually identical for all such spectra, so the averaging process is generally trivial. As reflected by Figure 17, it is clear that there is generally good agreement between the two N H measurements, with a moderate Pearson correlation coefficient of ∼0.6 and a mean difference of just ∼0.4 between them. This is favorable for the CSC's reported obscuration values, as the XZ AGN model is substantially different from the CSC's simpler model. The CSC employs a simple absorbed power law for all X-ray sources, which assumes z = 0 and does not account for the AGN structure that is considered by XZ. That the CSC recovers, on average, approximately the same obscuration values as XZ strongly suggests that the CSC's values are fairly reliable estimates of the true source N H .
In some individual cases, however, we have found considerable differences between the XZ best-fit N H and the CSC N H , suggesting XZ may be more effective at detecting high obscuration. In a particularly notable case, we have identified a possible Compton-thick candidate (2CXO J215141.1-055049). A Compton-thick AGN is defined as N H ≥ 1.5×10 24 cm −2 = σ −1 T , where σ T is the Thomson cross-section. This definition corresponds to logN H ≥ 24.2, and for the source in question, the bestfit logN H ∼ 24.0 nearly reaches that threshold, with the 1σ error range extending beyond logN H = 24.2, thus demanding consideration as a Compton-thick candidate.

Compton-Thick Candidate
The spectrum of the Compton-thick candidate fitted with the XZ AGN model is shown Figure 18, and a breakdown of the source's characteristics is given in Table 8 showing, among other quantities, its high 2-10 keV X-ray luminosity (logL X ∼ 46) and considerable CSC-documented hardness ratio (HR ∼ 0.999). As mentioned, the 1σ error range on its best-fit logN H ∼ 24.0 includes values beyond the Compton-thick threshold, and XZ found a best-fit z ∼ 2. This redshift estimate is assigned a high probability of accuracy by the machine learning algorithm (P + ∼ 0.99), which suggests it has a ∼99% chance of meeting our mild-to top-tier success thresholds. Our cross-match attempts did not yield a counterpart for this source in any of the databases considered (Gaia, SDSS, WISE, and Simbad), so it does not have a redshift or any other multiwavelength data associated with it, and it has only one available spectrum in the CSC. The XZ model produced a fairly good fit, with a test statistic χ 2 ν ∼ 1.39. Also, as shown in Figure 18's accompanying corner plots, key parameters Γ, logN H , and z are fairly constrained, with the posterior distributions showing strong peaks around the best-fit values.
The spectrum, containing 605 source counts in the Chandra broad band, has a source-to-background counts ratio of just ∼0.1, as it is dominated by background and hence the spectrum has a peculiar flat shape. As a result, the total XZ model for this source is mostly background, although the torus model is also prominent, while the scattering and APEC components offer virtually no contribution. This is illustrated in Figure  18, where we have plotted the total model, as well as the background and torus components, showing the negligible nature of the other components. The torus model appears promising due to its high hardness ratio, which is consistent with the CSC-reported value and suggests the source is a highly-absorbed, possibly Compton-thick AGN. However, to arrive at a more definitive conclusion, additional source counts are required to overcome the current background-dominated spectrum's low counts.
There also appears to be an excess at ∼6.5 keV, just beyond the highest possible observed energy for an AGN's Fe Kα line (assuming z ≥ 0), with its rest-frame E = 6.4 keV. There is no known 6.5 keV line in either the Chandra instrumental background (Bartalucci et al. 2014) or the CXB (e.g., Hickox & Markevitch 2006) and we do not expect any lines at that energy for a z ∼ 2 AGN. Accordingly, fitting the excess with an emission component does not significantly improve the model, and is found to be a consistent only with a statistical fluctuation, as its significance is between ∼0 and ∼1σ across four different methods of significance testing. Namely, we employed ∆χ 2 , the Bayesian Information Criterion (BIC; Schwarz 1978), the Akaike Information Criterion (AIC; Akaike 1974), and the Bayes Factor (Kass & Raftery 1995;Friel & Wyse 2012). Note that the BIC approach merely approximates the Bayes Factor and was done using CSTAT, while the Bayes Factor was computed directly using the statistical evidence (Z) supplied by the BXA modeling procedure.
Taking all of this into account, it is clear that, due to the potentially Compton-thick obscuration level, background-dominated spectrum, and lack of counterparts, this source is of interest and warrants future study. It is important to further investigate this source and potentially verify its status as a Comptonthick AGN, due to the important role such highlyobscured cases may have in our overall understanding of AGN. Despite the expected abundance of Comptonthick AGN, thought to compose ∼10-50% of the total AGN population (see Akylas & Georgantopoulos 2009, Vignali et al. 2010, Alexander et al. 2011, Buchner et al. 2015, Lanzuisi et al. 2015, Azadi et al. 2017, relatively few have been observed due to their high obscuration (e.g., Comastri et al. 2010, Vignali et al. 2014). X-ray surveys are known to reveal highly-obscured AGN via the hard band (e.g., Rees 1984, Ueda et al. 2003, so confirming the Compton-thick nature of this AGN would be a useful continuation of this impactful X-ray trend.

SUMMARY & CLOSING REMARKS
In this work we have pushed and extended the limits of XZ with the help of the CSC, simulations, and machine learning techniques. Using a combined data set of simulated spectra and CSC-searched, largely X-ray selected obscured AGN with documented multiwavelength   Table 8. A breakdown of the Compton-thick candidate's characteristics. Ns is the number of available spectra; S/B is the source-to-background counts ratio; logLX is the restframe 2-10 keV luminosity; and P+ is MLP-computed probability that the redshift estimate is successful.
redshifts, we analyzed XZ's failure rate, specifically its dependence on various fit and spectral characteristics. This culminated in the formulation of a machine learning algorithm in which we used 7 features to classify a given XZ fit as either a success or failure using a multilayer perceptron neural network. We showed that our classifier exhibits a statistically significant ∼3σ improvement over applying only the existing redshift IG ≥ 2 bits filter. After screening our en-tire data set with the machine learning model, we were left with ∼53% of the original data set (∼64% of the IG-filtered data). Despite our aggressive treatment and added filtering, this still leaves us with a higher percentage of the original sources than what was seen in Simmonds et al. (2018)'s CDF-S analysis, where the IG filter alone left only ∼23% of the original sources. We are left with a lower percentage than Peca et al. (2021)'s ∼70%, though in that case the 54-source data set was from a deep Chandra field with much more favorable redshifts, with z ≥ 0.5 for ∼85% of sources versus ∼47% for our data set. With the fully-filtered data set, we produced redshift estimates for 121 CSC-searched obscured AGN with no cross-matched documented redshift values, all of which were selected using only X-ray data. The majority of these redshifts were found between 0.5-3, the ideal range for studying supermassive black hole growth due to its prevalence in that era of cosmological history. Examining the XZ best-fit N H values, we have identified a possible Compton-thick source that will be the subject of further investigation.
The findings of this work further demonstrate the combined capabilities of X-ray observatories and data science techniques, such as machine learning, in astrophysics. Through our use of X-ray spectra and a neural network, we reinforced the work of Buchner et al. (2014a) and Simmonds et al. (2018), among others, in showing that redshifts can be reliably estimated for obscured AGN using only X-rays, particularly for X-ray AGN selected without successfully-matched multiwavelength counterparts. As indicated by our final machine learning model's precision, we expect nearly 90% of our 121 new redshift values to be consistent with hypothetical spectroscopic or photometric measurements, illustrating the reliability of our XZ and neural network combination.
We have also further demonstrated the wide applicability of the Chandra Source Catalog, which supplied us with all sources and their corresponding fully-reduced X-ray data products. This work, then, is an example of the large-scale science and data analysis that can be performed with the CSC. This is all while the CSC2 only contains observations up to 2014. Hence, these capabilities will continue to grow with future CSC releases, as it catalogs more recent data. Moreover, we expect the impact of this work to go beyond the machine learning model we have produced, serving more generally as a foundational building block for future computations of redshifts using only X-ray data. In particular, given the amount of progress our methodology was able to make using a relatively small training set by deep learning standards, as well as the moderate success of other machine learning frameworks we tested, we can reasonably hypothesize that larger future data sets will produce more advanced and more effective models.
In the midst of recently-launched and future missions such as Athena, this fortification of XZ with machine learning algorithms will allow for wide-reaching analysis on the large data sets these observatories can offer in their early days. For example, it will be difficult to match counterparts with obscured AGN detected by Athena due to its large PSF. With a machine learninginfused XZ, AGN can be selected using the observed hardness ratios and redshifts can be reliably estimated for the obscured sources among them, bypassing the wait required when relying on multiwavelength counterparts. Furthermore, the machine learning aspect of this approach was formulated for Chandra data, hence it can and should be thoroughly examined in the context of these new instruments, as well as in the context of other observatories such as XMM-Newton, to maximize its reach of applicability.
At minimum, this work serves as a proof-of-concept, showing the success of using machine learning to enhance the XZ method, opening the door for further similar improvements. Thus, with the enormous amount of X-ray data and machine learning techniques on the proverbial horizon, the future of X-ray redshifts and large-scale X-ray data analysis is remarkably luminous.

ACKNOWLEDGMENTS
DS acknowledges the University of Miami for supplying funding throughout the completion of this work. DS also acknowledges Martin Elvis, Giuseppina Fabbiano, and J. Rafael Martínez-Galarza for useful, enlightening discussions. FC acknowledges the support of NASA Contract NAS8-03060. The authors thank the anonymous referee for helpful commentary.
Software: Astropy (Astropy Collaboration et al. 2013;Astropy Collaboration et al. 2018), Bayesian Xray Analysis (Buchner et al. 2014a), CIAO (Fruscione et al. 2006), corner (Foreman-Mackey 2016, Matplotlib (Hunter 2007), NumPy (van der Walt et al. 2011), pandas (McKinney 2010 The pandas development team 2020), Scikit-learn (Pedregosa et al. 2012), Sherpa (Freeman et al. 2001), XSPEC 12.10.1f (Arnaud 1996 APPENDIX The machine learning model comparison mentioned in Section 5.2 can be found below in Table 9, while the corresponding ROC curves can be found in Figure 19. The redshift catalog mentioned in Section 6.1 can be found in Table  10. Posterior distributions for parameters can be found in Figure 20. 1.00 Table 10. The X-ray redshift catalog for the 121 sources in our fully-filtered No-z data set. Ns is the number of available spectra; Cts. is the number of Chandra broad band spectral counts; logLX is the rest-frame 2-10 keV luminosity; and P+ is MLP-computed probability that the redshift estimate is successful.