Optimized Photometric Redshifts for the Cosmic Assembly Near-Infrared Deep Extragalactic Legacy Survey (CANDELS)

We present the first comprehensive release of photometric redshifts (photo-z's) from the Cosmic Assembly Near-Infrared Deep Extragalactic Legacy Survey (CANDELS) team. We use statistics based upon the Quantile-Quantile (Q--Q) plot to identify biases and signatures of underestimated or overestimated errors in photo-z probability density functions (PDFs) produced by six groups in the collaboration; correcting for these effects makes the resulting PDFs better match the statistical definition of a PDF. After correcting each group's PDF, we explore three methods of combining the different groups' PDFs for a given object into a consensus curve. Two of these methods are based on identifying the minimum f-divergence curve, i.e., the PDF that is closest in aggregate to the other PDFs in a set (analogous to the median of an array of numbers). We demonstrate that these techniques yield improved results using sets of spectroscopic redshifts independent of those used to optimize PDF modifications. The best photo-z PDFs and point estimates are achieved with the minimum f-divergence using the best 4 PDFs for each object (mFDa4) and the Hierarchical Bayesian (HB4) methods, respectively. The HB4 photo-z point estimates produced $\sigma_{\rm NMAD} = 0.0227/0.0189$ and $|\Delta z/(1+z)|>0.15$ outlier fraction = 0.067/0.019 for spectroscopic and 3D-HST redshifts, respectively. Finally, we describe the structure and provide guidance for the use of the CANDELS photo-z catalogs, which are available at https://archive.stsci.edu/hlsp/candels.


INTRODUCTION
Cosmic Noon ( ∼ 2) saw the most intense star formation in the history of the universe; the rapid buildup of galactic disks and bulges, and the end of star formation in the most massive galaxies. Deep galaxy surveys investigate this critical epoch of galaxy formation by tracking the evolving galaxy demographics via the galaxy luminosity function (e.g., Cole et al. 2001;Reddy & Steidel 2009;Finkelstein et al. 2015), stellar mass function (e.g., Marchesini et al. 2009;Muzzin et al. 2013;Duncan et al. 2014;Tomczak et al. 2014;Davidzon et al. 2017), and the stellar mass-star formation rate relation (e.g., Noeske et al. 2007;Whitaker et al. 2012;Speagle et al. 2014;Salmon et al. 2015;Sandles et al. 2022). The Cosmic Assembly Near-infrared Deep Extragalactic Legacy Survey (CANDELS; Grogin et al. 2011;Koekemoer et al. 2011) combines high spatial resolution Hubble Space Telescope (HST) Wide Field Camera 3 (WFC3)/IR and Advanced Camera for Surveys (ACS) imaging with existing intermediate-resolution optical and IR imaging down to faint limits (5 point-source limit of ∼ 27 for wide and ∼ 27.7 for deep) to extend our knowledge of these important galaxy evolution probes. However, the measurements required for these probes rely on accurate redshifts.
Consequently, there have been multiple concerted efforts on 8-10 m class telescopes to obtain spectroscopic redshifts (spec-'s) for CANDELS galaxies (e.g., VVDS, Le Fèvre et al. 2005;zCOSMOS, Lilly et al. 2007;VUDS, Le Fèvre et al. 2015;and VANDELS, McLure et al. 2018). Spec-'s are the gold standard for accurate redshifts, but the galaxies with successfully measured spec-'s tend to be brighter, at lower redshifts, and preferentially biased toward star-forming galaxies with strong emission lines. Given the resourceintensive nature of measuring spec-'s, spectroscopic followup of a large and representative sample, especially of the faintest CANDELS objects, is impossible with current and near-future facilities (Newman et al. 2015). Grism-based redshifts (grism-'s) from HST (e.g., 3D-HST, Momcheva et al. 2016) can help alleviate some of the issues with groundbased spec-'s because they can cover many more galaxies and probe a more representative population of galaxies (Bezanson et al. 2016). Nevertheless, grism spectroscopy is technically challenging, does not cover the full depth of the CANDELS imaging, and is not available for the whole CANDELS footprint. Thus, we must rely on photometric redshifts (photo-'s) for distance information (Baum 1957) across the full range of galaxies, albeit at lower accuracy and precision than spec-'s or grism-'s.
Photo-estimation approaches mostly fall into two main groups-machine learning (ML)-based methods and template-based methods-with the optimal approach depending on the amount and quality of training data and science goals (see Salvato et al. 2019 and references therein). ML methods estimate redshift from galaxy photometry by training an ML model on galaxies with spec-'s. ML methods outperform template-based methods in the regime of complete and representative coverage of color-space, such as Sloan Digital Sky Survey photo-efforts (e.g., Beck et al. 2016;Pasquet et al. 2019;Hayat et al. 2021;Dey et al. 2022a), where datadriven interpolation is very accurate. Template-based methods involve redshifting spectral energy distribution (SED) templates (or combinations thereof) to find the best fit to a galaxy's SED. Template-based methods are superior to ML methods when spectroscopic coverage is incomplete and/or biased (e.g., in deep fields) because they leverage our knowledge of galaxy physics. Hence, only template-based methods have been employed for CANDELS due to the biased and incomplete spec-training sets. However, shortcomings in our knowledge of galaxy physics can limit the success of templatebased methods; issues include template mismatch (i.e., when the SED templates differ from the observed SEDs), incorrect treatment of dust attenuation, and inappropriate priors (Salvato et al. 2019;Newman & Gruen 2022).
Template libraries fall intro two broad classes, empirical and theoretical, each with its own relative advantages and disadvantages. Empirical templates (e.g., Coleman et al. 1980 andKinney et al. 1996) are principally derived from spectroscopic observations of nearby galaxies (and supplemented with model spectra to increase wavelength coverage), so they can accurately capture spectral features in real galaxies. However, they are based on a relatively small set of observations at very low redshift, and do not span the full range of potential ages, metallicities, or galaxy types in CANDELS. Empirical templates can be optimized for high-redshift galaxies through calibration (e.g., Ilbert et al. 2006), but this process requires a large spectroscopic redshift sample that is usually not available and not representative of the photometric galaxy sample.
Theoretical template sets are constructed from stellar population synthesis models (e.g., Fioc & Rocca-Volmerange 1997, Bruzual & Charlot 2003, and Maraston 2005, so they cover more of color-redshift space and perform better at high redshift (Brammer et al. 2008). All of the groups in this work used theoretical template libraries. However, theoretical template libraries cannot fully cover color-redshift space, and the mismatch between template and observed SEDs remains one of the biggest sources of systematic uncertainty in photo-estimation using template-based methods (Newman & Gruen 2022). Furthermore, theoretical template libraries must assume star formation histories (SFHs) for their template SEDs, include (or not) emission lines, and adopt a dust attenuation law. These choices represent a simplified version of the properties of real galaxies, especially given the wide range in redshift, luminosity, color, and galaxy type of CANDELS galaxies. Priors can help break degeneracies between multiple redshift solutions by incorporating knowledge about the galaxy population as a whole (e.g., the abundance of galaxies as a function of luminosity and redshift), but they can introduce biases. For instance, in Figure we show that the stacked redshift probability distribution functions (PDFs) from different codes produce different redshift peaks because of differences in template sets and assumed priors even though they all perform well according to the point estimate metrics used.
Photo-methods can be evaluated using spectroscopic training sets (e.g., Hogg et al. 1998;Hildebrandt et al. 2008Hildebrandt et al. , 2010, clustering measurements (Newman 2008), and mock catalogs generated from simulations (Schmidt et al. 2020), but for CANDELS, only the first is a viable option. Most relevant to this work is the comparison by Dahlen et al. (2013) that studied 11 photo-analyses of the same CANDELS photometry and spec-data set. These analyses utilized a diversity of codes, template sets, and priors-all of which impact the resulting photo-PDFs. Dahlen et al. (2013) found that none of the photo-codes significantly outperformed the others. However, the analyses that used spec-'s to calibrate the photo-PDFs (e.g., by implementing zero-point offsets to the photometry or modifying the templates) achieved better results. They found that photo-PDFs from individual analyses were generally too narrow to be consistent with the statistical definition of a PDF (as did Bezanson et al. 2016), for which they implemented a correction. Finally, they were able to improve the photo-PDFs by combining the PDFs from individual analyses via several methods: taking the median, summing PDFs, and a hierarchical Bayesian approach.
This paper builds off of Dahlen et al. (2013) by comparing a more recent effort by six groups running five different photo-codes with a larger spec-data set and grism-'s to better study performance at fainter magnitudes where the low signal-to-noise ratio of the photometry can drive discrepant results between analyses. We calibrate the PDFs from individual codes and introduce a better method for combining PDFs, the minimum -divergence (Rényi 1961). We also test the performance of each group's PDFs using point estimates (specifically, the weighted-mean redshift weight ). Our results are presented as the final CANDELS photo-catalog. This paper is organized as follows. Section 2 describes the data sets employed in this paper. Section 3 explains the methods used to improve PDFs from individual photocodes and assesses the results with a variety of summary statistics. In Section 4 we test several methods of combining PDF results from multiple photo-codes. Section 5 presents the final photo-catalog for the CANDELS fields. Finally, Section 6 summarizes the major findings of this study and provides guidance on the use of these catalogs.
Throughout the paper we assume AB magnitudes unless differently stated. In order to allow direct comparison with existing works from the literature of X-ray surveys, we adopt a flat Λ cold dark matter cosmology with ℎ = 0 /[100 km s −1 Mpc −1 ] = 0.7, Ω =0.3, and Ω Λ =0.7.

Photometry
Throughout this study we use data from the CANDELS collaboration. CANDELS obtained observations in HST/WFC3 F105W, F125W, and F160W as the primary exposures and HST/ACS F606W and F814W as parallel exposures. These observations covered five fields spanning a total area of ∼ 800 arcmin 2 on the sky with ∼ 125 arcmin 2 in the deep portion and ∼ 675 arcmin 2 in the wide portion. The deep portion has a depth of ∼ 13 orbits per pointing (spread across three filters) and includes data in two of the CANDELS fields, GOODS-North and GOODS-South. The wide portion consists of ∼ 2 orbits per pointing and includes buffer regions around the deep fields as well as three additional fields, COS-MOS, the Extended Groth Strip (EGS), and the UKIDSS Ultra-Deep Survey field (UDS). Detailed descriptions of the sky coverage and observing strategy of CANDELS can be found in Grogin et al. (2011) and Koekemoer et al. (2011).
One of the most powerful aspects of the CANDELS data set is the wealth of complementary photometry for these heavily studied extragalactic fields. We utilized matched-model TFIT (Laidler et al. 2007) photometry (from the v1 photometric catalogs) from ground-and space-based UV to IR imaging as the input for the photometric redshift codes. The sources and bands vary somewhat between fields but are typicallyband through Spitzer/IRAC 8 m with F160W serving as the selection band for the photometric catalogs. Full descriptions of the catalogs for each CANDELS field are given in Nayyeri et al. (2017, COSMOS), Stefanon et al. (2017, EGS), Guo et al. (2013, GOODS-S), Galametz et al. (2013, UDS), and Barro et al. (2019, GOODS-N). In the interest of uniformity across fields, we only used broadband photometry. For instance, we did not utilize the medium-band photometry from the SHARDS survey (Pérez-González et al. 2013) for the GOODS-N field, so for certain use cases (e.g., environment) K .
the higher-precision photo-'s from Barro et al. (2019) that use those bands will be preferable.

Individual Photometric Redshifts
Six groups within the CANDELS collaboration have used the TFIT photometric catalogs to estimate photo-PDFs for each galaxy observed by CANDELS. For each galaxy, the codes fit a set of SED templates in redshift space and minimized 2 between the observed and template SEDs. To calculate a redshift posterior PDF, most codes use a magnitudedependent redshift prior for each object combined with a likelihood proportional to exp(− 2 /2). To measure photo-'s, the groups used different codes for calculation, different template SEDs, and/or different criteria when using a code (e.g., different priors). In the following, we briefly explain each of the codes used to estimate photo-PDFs in this paper.
EAZY (Brammer et al. 2008) fits a linear combination of stellar population templates to the observed -to-8 m SEDs (in flux space) by minimizing the 2 statistics. Corrections for absorption by intervening HI clouds are applied following Madau (1995). EAZY allows down-weighting data at rest-frame wavelengths ≥ 2 m via a template error function. It incorporates a "luminosity-function × volume" prior that reduces the probability of low redshifts (due to the small volume) and of high redshifts for objects with bright apparent magnitudes. In addition, an iterative application of photometric zero point offsets improves the match to the available spectroscopic redshifts.
EAZY was run by two separate groups, led by S. Finkelstein and S. Wuyts, respectively. The two groups differed in their choices of parameters and priors. The Wuyts group used the "EAZY_v1.0_lines" templates, which are identical to the original "EAZY_v1.0" templates (based on the PÉGASE stellar population synthesis models; Fioc & Rocca-Volmerange 1997) but with additional emission lines added following the prescription by Ilbert et al. (2008). The six templates contain a subset of five "principal component" templates constructed as described by Brammer et al. (2008), complemented by an additional template to account for dusty galaxies that do not appear in the semianalytical models that the principal component set was based on. The Finkelstein group used the "EAZY_v1.1_lines" templates, whose first six templates are the same as the templates from the "EAZY_v1.0_lines" set except that all of the PÉGASE emission lines were replaced using the Ilbert et al. (2008) prescription. It also has an additional template to account for massive old red galaxies.
zphot (Giallongo et al. 1998;Fontana et al. 2000) is a code that minimizes 2 statistics when fitting template SEDs. It http://www.astro.yale.edu/eazy/ returns the best-fitting values and estimated photo-uncertainties as well as other parameters of the galaxy template (e.g., age, metallicity, and dust extinction). zphot accepts both fluxes and magnitudes as input with a rigorous treatment of nondetections in the latter case. The code allows for a minimum photometric error to be set in each photometric band and optionally applies a cut to avoid unrealistically large negative fluxes. For the templates, we used the PÉGASE 2.0 models (Fioc & Rocca-Volmerange 1997), which assumed the stellar initial mass function from Rana & Basu (1992) because they yielded the most accurate photo-'s with the set of parameters explored. Emission lines were not included in the templates. Our adopted template library was constructed with a combination of SFHs. Specifically, we used constant and truncated SFHs as well as an SFH with the star formation rate proportional to the amount of gas. Metallicity evolution is measured self-consistently during the galaxy evolution. We implemented dust extinction with a Calzetti et al. (2000) extinction law with E(B-V) ranging from 0 to 1. Finally, to avoid contamination from nonstellar emission it excludes IRAC bands that probe rest frame > 5.5 m (ch3 at ≤ 0.15 and ch4 at ≤ 0.6). This code was run by the group led by A. Fontana.
HyperZ (Bolzonella et al. 2000) performs fits by minimizing the 2 statistics in flux space while excluding negative fluxes. The user can specify shifts to magnitudes. To avoid unrealistic solutions, a prior for the F160W band absolute magnitude in the range −30 < < −9 (Vega mag) was used. Reddening was included in the form of the Calzetti et al. (2000) reddening law varying from 0 to 3 in steps of 0.2. SED templates are based on the Maraston (2005) models with exponentially declining SFHs (with e-folding times = 0.1, 0.3, 1, and 2 Gyr) at four metallicity values (1/5, 1/2, 1, and 2 × solar metallicity). The allowed age range of the templates was restricted between 0.1 Gyr and the age of the universe at the given redshift of the galaxy in question. Finally, the option for a minimum photometric error was set to 0.05 magnitudes. This recipe was introduced in Pforr et al. (2013). This code was run by the group led by J. Pforr.
LePhare (Arnouts & Ilbert 2011) minimizes 2 statistics when fitting template SEDs, both in magnitude and in flux space. It has the capability to include luminosity priors, to add extra photometry errors, to use a training sample to optimize the template SEDs and to derive zero-point offsets, and to account for contributions from emission lines (Ilbert et al. 2006(Ilbert et al. , 2009. The median values from the photo-PDFs are also provided. Here, a prior on the optical absolute magnitude http://webast.ast.obs-mip.fr/hyperz/ http://www.cfht.hawaii.edu/~arnouts/LEPHARE/lephare.html in the range −24 < < −8 is used, while IRAC ch3 and ch4 are excluded. The SED template set was a combination of the BC03 and Polletta et al. (2007) templates. This code was run by the group led by M. Salvato.
WikZ (Wiklind et al. 2008) fits SED templates to observed photometry in flux space by minimizing 2 statistics. The code can be run with two different parameterized SFHs, an exponentially declining star formation rate or a delayed-SFH. In the current application, the code was run using a delayed-SFH with the BC03 SED templates. Negative fluxes are not completely excluded, but they add to the 2 when the template flux is brighter than the 1 upper limit, while they are excluded when the flux is lower than the 1 upper limit. Additionally, it excludes IRAC ch3 and ch4 for < 0.5 and < 0.7, respectively. Finally, it has the capability to add extra smoothing errors to the photometry. This code was run by the group led by T. Wiklind.

Spectroscopic and 3D-HST Grism Redshifts
Spectroscopic redshift are used to train photometric redshift codes (e.g., by identifying zero-point offsets in observedframe photometry that, if removed, will improve fits) and to test the accuracy of the photo-estimates. The six photoparticipant groups used a training set of 5807 high-quality spectroscopic redshifts spanning all five CANDELS fields. The same set of redshifts is used to recalibrate the PDFs in Section 3.
Our primary testing set consists of 4089 high quality spectroscopic redshifts drawn from a variety of sources completely independent of those included in the training set with any overlapping objects removed. We use this set to assess the performance of point statistics (e.g., the redshift of maximum probability) and to test the quality of photo-PDFs from each code, both before and after the optimization procedure described below.
We also use 3D-HST grism redshifts (grism-'s; Momcheva et al. 2016) to test the photo-point estimate and PDF quality. This set consists of 3367 of the highest-quality redshifts spanning all of the CANDELS fields. The grism-'s were determined spectrophotometrically with EAZY, but run by other investigators with different priors and using different photometry. To mitigate the impact of incorporating photoinformation in the grism redshift fits, we restricted to objects where the spectrum, not the photo-, drove the fit. We also include only objects with redshifts larger than z grism > 0.6. Figure 1 and Figure 2 show the -band magnitude as a function of redshift for the spectroscopic sample, as well as the marginal histograms for each of the five CANDELS fields separately. Table 6 gives details regarding the construction of these three sets, including references to the original catalogs and the specific cuts applied to select only the highest-quality, most secure redshifts.
Due to the order in which spec-'s were obtained, the objects in the training sets used to tune the photometric redshift methods differ in both galaxy properties (such as brightness) and redshift coverage from the testing and 3D-HST sets, as illustrated in Figure 1 and Figure 2. This difference is greatest in the COSMOS and EGS fields. As a result, the performance metrics derived using these independent data sets would be expected to be worse than would be measured using a random training/testing assignment from a common population. In fact, the objects for which we estimate photo-'s extend to fainter magnitudes and at higher redshifts than the training sets, so this more pessimistic assessment is likely more realistic for assessing the performance for the broader population, though we cannot exclude the possibility that photo-errors are worse in regimes for which we have no spectroscopy.

OPTIMIZATION METHODS
The photo-PDFs produced by the different groups are not analogous to the true probability distributions that meet the statistical definition of a PDF (see e.g., Dey et al. 2021, Dey et al. 2022b. Given the high likelihood that the spectroscopic redshifts used here are correct (>95%), we can treat them to first order as representing the true redshifts of the objects observed. If the statistical definition of a PDF is being obeyed, then we would expect 68% and 95% of them to fall in the 68% and 95% credible intervals of the photo-PDFs, respectively.
In an earlier test with CANDELS data, Dahlen et al. (2013) found that while some codes had better results than others, none of them performed well when the coverage of credible intervals was assessed in this way. This indicates that the photo-PDFs must have substantial, qualitative problems. Simple examples of issues would be the presence of bias due to template mismatch, such that the PDFs are shifted to higher or lower redshifts than expected, or inaccuracies in photometric error models that caused the PDFs to be too wide or too narrow.
In this section, we will describe how we optimize the photo-PDFs produced by each group for each CANDELS field. We correct for (1) systematic shifts in the PDFs and (2) systematically inaccurate PDF widths.

-Plot and Metrics
In this work, we make use of a set of statistics related to the Quantile-Quantile ( -) plot to optimize the recalibration of the photo-PDFs produced by each group. While these statistics will prioritize adjustments that improve agreement with the statistical definition of a PDF, they should also deliver more accurate redshift point values and error estimates by reducing biases. Figure 3 depicts the -plot for a set of objects from the spec-training set with Finkelstein photo-'s as an illustrative example. This plot shows Data -the fraction of the time that the spectroscopic redshift is below -band magnitude as a function of redshift for objects with spectroscopic redshifts (divided into training and testing sets) and 3D-HST grism redshifts for the COSMOS, EGS, and GOODS-N fields. The redshift and magnitude ranges of the testing and 3D-HST data sets strongly differ from the training set; as a result, they provide highly independent assessments of the quality of photo-PDFs, which are optimized based upon the training set. K .
the redshift corresponding to a given quantile in the photocumulative distribution function (CDF)-as a function of the chosen value for the quantile, Theory . By construction, the value of Theory must range from 0 to 1, just as the CDF does (e.g., Theory = 0.345 corresponds directly to the CDF value of 0.345). In our example, Data (0.345) will be the fraction of the time that the spectroscopic redshift of an object is below the redshift where the CDF value for that object is Theory = 0.345. If the PDFs satisfy the statistical definition, Theory will be the same as Data (apart from small variations due to sampling error), so the -plot will hew closely to the unit line between (0, 0) and (1, 1). To help in interpreting the -plots shown in this paper, Figure 4 demonstrates how systematic shifts or incorrect widths of PDFs will manifest in them.
We use the normalized ℓ 2 -norm to quantify the deviation of the -plot from the identity line to assess how close a set of PDFs is to the ideal case. The quantity Δ = Data − Theory , evaluated at a particular value of Data , will correspond to the gap between the -curve for a given sample and the unity line (doing this as a function of Data has computational advantages). The ℓ 2 -norm will then correspond to the square root of the sum of this distance squared, ( ∫ (Δ ) 2 Data ) 1/2 . This distance will be zero in the ideal case.
For this analysis, we only include galaxies whose spectroscopic redshift lies between the 0.0001 and 0.9999 quantiles of their CDF in the construction of the -plot and in the calculation of the ℓ 2 -norm. Objects outside that range correspond to catastrophic outliers (as commonly occur in photometric redshift analyses) whose true redshift lies beyond the range that would be expected based on their estimated PDF. We do not want their distribution to affect the recalibration of PDFs for the much more common nonoutlier objects. Excluding them also makes our analysis considerably more robust to the possibility of erroneous spectroscopic redshifts. Hence, in the end, we calculate a normalized version of the ℓ 2 -norm by summing up the value of (Δ ) 2 at the Data values corresponding to each object in the sample excluding those that were outside these CDF bounds, dividing by the total number of (Δ ) 2 values that were summed, and then taking the square root.
The second basic statistic we use to characterize the quality of PDFs from a given algorithm is the fraction of objects for which their spectroscopic redshift has a CDF value less than 0.0001 or greater than 0.9999. We label this quantity as op (for "fraction outside of PDF"). This statistic roughly corresponds in its nature to the catastrophic outlier rate used to characterize the performance of point estimates from photometric redshift algorithms but is based on PDF outputs rather than scalar quantities.
To understand possible redshift-and magnitude-dependent biases among the PDFs generated by different participant Data is the fraction of objects for which the spectroscopic redshift is below the redshift corresponding to that photo-quantile in the CDF for that particular object. For instance, if the spectroscopic redshift is below the 50th percentile point in the photo-CDF for 64.7% of objects, then (0.5, 0.647) is a point (red circle) along the -curve (blue line). If photo-PDFs obey the statistical definition of a PDF, -curves will lie along the solid gray reference line from (0, 0) to (1, 1). Their deviation is quantified by the square root of the sum of the vertical distance (pink line) at each point squared (ℓ 2 -norm). The ℓ 2 -norm is normalized by dividing by the square root of the number of objects. Additionally, the op gives the fraction of objects with spec-'s that fall outside of the main region of the corresponding PDFs.
codes, we analyze the normalized ( ) in a fixed magnitude bin. In Figure 5, we show the set of the summed PDF curves for EGS objects in the magnitude bin centered at = 25 predicted by each group, both before and after the optimization procedures (described below) have been applied. Even after optimization, the predicted number of objects at low redshifts varies greatly from group to group. Results in this regime may have limited reliability. In Figure 6, we show a set of images constructed from curves such as those shown in Figure 5. These images show redshift as a function of magnitude (using the center of the magnitude bin used to construct the summed PDF curves). The intensity of the color scale indicates the value of the summed PDF at a given magnitude and redshift (corresponding to the -values in Figure 5). Although all of the codes used to produce the photo-PDFs yield good results when evaluated with standard point statistics for objects with  spectroscopic redshifts (as described below), and all are using the same photometry as inputs, the redshift distributions as a function of magnitude from each group differ in many details, even after optimizing the individual photo-PDFs such that they better obey the statistical definition. We find that photo-PDFs by different groups show divergent behavior at < 0.3. Therefore, we exclude these objects for our further analysis.

Correcting for Global Shifts
Using the -statistics defined above, we can identify any aggregate bias in the photo-PDFs of a given group by applying negative or positive shifts in the redshift direction ( ) to the PDFs. Our sign convention is such that negative values correspond to shifts to the left on a plot of PDF( ), such that the PDFs will peak at lower values of redshifts,  Figure 5. Sum of the photo-PDFs for all objects in a magnitude bin of width 0.5 centered at = 25 mag in the EGS field. The curves correspond to the expectation value for the redshift distribution in this bin. We exclude objects with missing PDFs for any of the groups from the sums to ensure that all curves are equivalent. The disagreement between groups is greatest at low redshifts (below = 0.3, indicated by the gray dashed line). Because of this divergence, we exclude objects with ≤ 0.3 from the calculation of -statistics. Although all codes deliver good performance when PDF peak redshifts are compared to spec-'s, the aggregate predictions for broad galaxy samples differ significantly from code to code. and vice versa. We apply shifts corresponding to an array of 151 equally spaced values over the interval [−0.5, 1.], and construct the -curve in each case after removing objects with spec-'s outside their PDFs. We then identify the shift value that minimizes the normalized ℓ 2 -norm; that is, the value that yields a -curve as close as possible to the ideal unit line. To do this, we interpolate the normalized ℓ 2 − norm values using a quadratic univariate spline (using the scipy.interpolate.UnivariateSpline routine with parameters = 2 and = 1.) The goal of this interpolation is for greater accuracy and smoother results.
In Figure 7, we present the dependence of the normalized ℓ 2 -norm on for each group and CANDELS field. Photo-'s from some groups exhibit larger biases than others, with the bias varying from field to field; this may be due to the differences in the imaging bands included in each field and the target selection function of spectroscopic surveys carried out in those fields. While the required shift for the COSMOS field is the smallest for four of the groups, we did not include all of the available medium and narrow band photometry for this field (that would lead to improved photo-estimates) in the interest of uniformity across fields, so we expect that this feature may be due to the spectroscopy available in the COSMOS field. With only one exception (Pforr results for the GOODS-South field) the shifts are all positive. That indicates that for almost all fields and groups, the photo-PDFs peak at lower values than they should, biasing point estimates (such as the redshift of the PDF peak) low. Table 1 gives the optimal shift values that should be applied to the photo-PDFs from each group and each field to yield better performance in the -plot. We will assess the level of improvement achieved in both -statistics and point estimates in sections to follow, using the independent testing set of spectroscopic redshifts as well as 3D-HST.

Correcting PDF Width
To correct for either overestimation or underestimation of photo-errors, we stretch/sharpen the PDF widths. We do this by raising a PDF to a power of = 1/ and then renormalizing the PDF to integrate to one. For a Gaussian PDF, this is equivalent to multiplying the parameter by 1/2 , and so provides a way to correct for either overestimates or underestimates of errors; however, the procedure we use can be applied to any PDF whether it is Gaussian or not. Optimal values of the parameter that are close to unity imply that little change is needed in the width of the PDFs. Conversely, when the optimal values differ significantly from unity, then large differences in the widths of the PDFs are implied (corresponding to significantly overconfident or underconfident error models).
We determine optimal values of after we apply the optimized shifts that were determined by the preceding analysis. We consider values in the range 0 < < 1 as well as in the range ≥ 1, corresponding to > 1 and 0 < ≤ 1, respectively. The former range corresponds to making photo-PDFs sharper, since high peaks of the PDF become higher while low PDF values in the tails or valleys between peaks become smaller. Conversely, ≥ 1 stretches the photo-PDFs into broader shapes, since the contrast between the highest peaks and low values of the PDF is reduced.
For each group's PDFs in a particular field, we search a coarse array of values spanning [0.05, 7], construct the - Table 1. The optimal values of shifts for the PDFs from each group for each field, determined by identifying the minima in the curves shown in Figure 7. Most of the shifts are positive, indicating that PDF( ) curves need to be shifted to the right. Results from some groups exhibit larger biases than others, with Salvato and Wuyts needing the smallest shifts and Wiklind the largest, especially in the EGS field. EGS, optimized Figure 6. Summed PDFs as a function of magnitude and redshift, before and after the photo-optimization procedures, for the EGS field. The color scale of the image indicates the summed and renormalized probability that an object in a particular magnitude bin is at the selected redshift, equivalent to the -axis in Figure 5. Objects with missing PDFs for any of the groups are excluded from the sums. The shape and locations of the bright regions (which should correspond to redshifts where there is an excess of objects due to sample/cosmic variance), as well as other detailed features, differ significantly from group to group, even though all groups used identical photometry for the same set of objects to calculate the PDFs used here.
plot in each case, and evaluate the normalized ℓ 2 -norm for each value of . We then use a finer grid of -values focused around the coarse-grid value that minimizes the ℓ 2 -norm to obtain improved precision in the optimal value. Figure 8 shows the dependence of the ℓ 2 -norm on for each group and CANDELS field. The Wuyts PDFs are the best calibrated with all fields having best-fit -values near unity. The Salvato PDFs show moderate scatter around = 1 with some fields having PDFs that are too narrow and some too wide. The Finkelstein and Pforr PDFs are consistently too wide and too narrow, respectively. The Fontana and Wiklind PDFs demonstrate larger field-to-field scatter with some fields having PDFs that are too narrow and some too wide. In Table 2, we list the values required to optimize the raw photo-PDFs such that they most closely match the statistical definition.

Evaluation Using -Statistics
After identifying the optimal values of the and parameters using the training set of spectroscopic redshifts and their corresponding photo-PDFs, we can apply the corresponding transformations to all of the PDFs from a particular group and field. We can then construct -plots and evaluate the overall normalized ℓ 2 -norm and op values for each group evaluated with the training data (which should have results that are biased low) as well as with the independent testing set of spectroscopic redshifts and the 3D-HST grism redshifts. Since some fields have relatively few testing redshifts available, we combine the objects from all five CANDELS fields to make the -plots. We present the -plots for each group in Figure 9. In each panel, dashed curves correspond to the -plots for each spectroscopic data set before applying shift and power transformations, and solid curves show the -plots after transformations. It is clear that all curves more closely follow the diagonal reference line after optimization.
In Figure 10, we show the normalized ℓ 2 -norm and op values before and after optimization of the PDFs. Although in a few cases the op value becomes slightly larger after optimization, gains in the normalized ℓ 2 -norm are substantial K . and ubiquitous. This is to be expected as the optimized ℓ 2norm values are lowest for the training set, as by construction our optimization chooses the transformation parameter values that make that quantity as small as feasible. However, it is very encouraging that substantial improvements are also found for the testing and 3D-HST data sets, which have very different coverage in magnitude-redshift space from the training set. We note that the Wiklind analysis produced significantly more objects with missing PDFs, which were not treated as out-of-PDF here. Thus, the low values of op for that analysis should be considered lower limits on what the performance would be for the full set of objects.

Evaluation Using Point Statistics
Although our primary focus is on producing accurate photometric redshift probability distributions, we also can evaluate the performance of point (or summary) statistics for the photometric redshifts of CANDELS objects before and after optimization. We focus on two different point statistics which are constructed from the photo-PDFs: the most probable redshift (which we label peak , as it corresponds to the redshift where the highest peak of the photo-PDF occurs), and the probability-weighted expectation value of the redshift ( weight ). Following Dahlen et al. (2013), we compute this quantity using only the region surrounding the highest peak of the PDF (specifically, the redshift range within that peak where the PDF value remains above 0.05 × PDF( peak )).
To determine this range, we first linearly interpolate the PDF onto a grid of 64,001 redshifts in place of the original 1001, then use cubic spline interpolation on this finer grid to find the redshifts around peak where the PDF is equal to 0.05 × PDF( peak ). If the PDF values are never lower than 0.05 × PDF( peak ), then the bounds used to calculate weight are the same as the bounds of the grid on which the PDFs are defined. In general, the Dahlen et al. (2013) definition of weight yields better results than the overall expectation value of the redshift evaluated over the full PDF, as in cases where PDFs have multiple peaks, the overall expectation value will often lie between peaks in regions of negligible probability.
We use two quantities to evaluate the accuracy of these point statistics before and after optimization. The first is the normalized median absolute deviation ( NMAD ) of the differences between photo-point estimates and spectroscopic redshifts for the corresponding objects, defined by: where Δ = phot − spec is the difference between a point estimate of the photometric redshift for an object and its spectroscopic redshift. The normalizing factor of 1.48 in the definition of NMAD causes the expectation value of the NMAD to equal the standard deviation for data that is drawn from a normal distribution. In addition to the NMAD, we track the fraction of catastrophic outliers in a particular group's results, co , as another useful statistic for characterizing photo-quality from each group. We define catastrophic outliers as those objects that fulfill the condition: Here the limit 0.15 is somewhat arbitrarily chosen; we will show below that typical photometric redshift errors for CAN-DELS objects with spectroscopic redshifts are ∼ 0.03(1 + ), so the threshold of 0.15 approximately corresponds to 5 outliers. We find that the NMAD is smaller when computed using weight rather than peak ; that is, weight provides a superior estimate of the redshift of an object. Therefore, we only show results for statistics computed using weight in the remainder of this paper, although we still include peak in our final photocatalogs. Figure 11 shows the scatter plots of weight vs. spec for all six groups, using only objects belonging to the training set of spec-'s. The weight values are calculated using the optimized version of the photo-PDFs. In the same plots we present the NMAD , the percentages (and number of objects in parentheses) of catastrophic outliers, as well as the percentage (and number of objects in parentheses) of the objects  . Quantile-Quantile ( -) plots for photo-PDFs both before ("original," dashed curves) and after ("optimized," solid curves) optimized transformations have been applied. The -plot shows the CDF value evaluated at the spectroscopic redshift of an object, with data corresponding to a given quantile of the set of CDF values, theory . For instance, if the 30th percentile in the set of CDF( spec ) values is 0.21, (0.3, 0.21) would be a point along the -curve. If photo-PDFs obey the statistical definition of a PDF, -curves will lie along the unit line from (0, 0) to (1, 1). We evaluate the results from each group separately using the independent training and testing sets of spectroscopic redshifts as well as the 3D-HST grism redshifts; results from all five CANDELS fields are combined together here. It is clear that our optimization methods improve the results for each group and for every set of redshifts. The normalized ℓ 2 -norm used to evaluate PDF accuracy corresponds to the RMS deviation between a -curve and the diagonal in the -direction in these plots.  that have missing photo-PDFs and therefore missing weight values. Below each scatter plot, we show the dependence of Δ /(1 + spec ) on spec , where Δ = weight − spec . In the same plots we also present the redshift dependence of NMAD = 1.48 × Δ /(1 + spec ) (dashed gray line), and the Hodges-Lehmann mean of Δ /(1 + spec ) (solid gray line), where bins of 0.5 have been used along spec for their construction. They generally do not seem to vary significantly with redshift, apart from the spikes observed at spec ∼ 5.5, where the number of objects becomes very low. Figure 12 plots the NMAD values and catastrophic outlier fractions for the photometric redshifts from each group, calculated using the PDFs both before and after optimization has been applied; results for each spectroscopic sample are shown in separate panels. In general, recalibrating the PDFs significantly improves the scatter between photo-'s and spectroscopic redshifts, while improving or only negligibly degrading the outlier fraction. The Wiklind analysis produced significantly more objects without any point estimates, which were treated as catastrophic outliers in this case. Consequently, the high values of co for that analysis should be considered upper limits on what the performance would be for the full set.
For the objects belonging to the training or testing sets, the outlier fraction is similar from all of the different groups (roughly 8%), while it ranges from 2% to 5% for objects with 3D-HST grism redshifts. There are two possible explanations for this discrepancy. One is that a significant (∼ 5%) fraction of the objects in the training and testing samples have incorrect redshifts assigned to them (potentially due to incorrect matches between the spectroscopic target and CANDELS galaxy). The other possibility is that the smaller outlier fraction for 3D-HST is artificially low due to the grism redshift fits incorporating photo-information, though we limited to objects whose spectrum drove the grism-fit to mitigate this possibility.
Finally, we compute the fraction of objects that are both out-of-PDF and catastrophic outliers ( op & co ). These objects are the most problematic because they have both a poorly estimated photo-and poorly characterized photo-uncertainty. Generally, around 3% of objects fit into this category across codes (excluding the Wiklind results that should be viewed as lower limits due to their high missing PDF rate driving a low out-of-PDF fraction).

METHODS FOR COMBINING PROBABILITY DENSITY FUNCTIONS FROM MULTIPLE CODES
Previous works have found that combining results from multiple photometric redshift codes can yield superior performance over individual codes when testing with spec-'s. . Scatter plot of weight vs. spec (upper panels), using the optimized version of photo-PDFs for the calculation of weight from the six participating groups. The NMAD , the outlier percentage, and the percentage of objects with missing photo-PDFs (and therefore missing weight ) are listed along with the actual number of objects (given in parentheses) for the latter two statistics. Additionally, Δ /(1 + spec ) vs. spec (lower panels) is presented below each scatter plot. The gray dashed and solid lines correspond to NMAD and Hodges-Lehmann mean of Δ /(1 + spec ), as a function of spec . Bins of 0.5 in spec have been used for the calculation of these two curves.  Figure 12. The normalized median absolute deviation between photometric and spectroscopic redshifts, NMAD , as a function of the catastrophic outlier fraction, co from each group both before (open circles at arrow bases) and after (filled circles at arrow heads) optimization. The three panels show these statistics evaluated using the training set of spectroscopic redshift (top); the testing set (middle); and 3D-HST grism redshifts (bottom). Results shown were calculated using the weighted-mean photo-peak redshifts ( weight ) from each group. Results from Wuyts outperform the other groups in all cases, especially when evaluated using the 3D-HST data set. We caution that the 3D-HST redshift determination incorporates results from the EAZY code used by Finkelstein and Wuyts as part of the grism redshift determination process, so the level of agreement for them may be artificial. the groups contributing measurements yielded a lower scatter than any single code. Since we are not only interested in point values (such as weight or peak ) but also desire accurate photo-PDFs (or, equivalently, accurate error estimates), we require methods that can combine multiple PDFs. In this paper, we utilize two different methods for combining PDFs: the HB approach first presented in Dahlen et al. (2013), which produces an entirely new PDF constructed based on the input PDFs (see also Duncan et al. 2018b, Duncan et al. 2018a, and Hatfield et al. 2022, and the minimum -divergence method (mFD), which selects one of the input PDFs as being the most representative for a given object (much as the median of a set of numbers is the element of that set which approximates its central value).

Hierarchical Bayesian Combination of Photometric Redshift PDFs
The HB approach introduced by Dahlen et al. (2013) combines PDF results from multiple codes based upon the assumptions that estimated PDFs for a given object are not entirely statistically independent (since they are based upon the same photometry) and that some may be inaccurate. This basic framework has a number of antecedents in the literature (Press 1997;Newman et al. 1999;Lang & Hogg 2012).
If we assume that the PDF from a given group is either informative about the true redshift of an object ("good") or uninformative ("bad"), the posterior probability for the redshift of an object from the th code, ( ), can be described as: where ( |bad) is the PDF resulting from an uninformative measurement (which we take here to be a uniform probability distribution from = 0 to = 10); (bad) is the probability a randomly selected object has an uninformative measurement; and ( |good) is the redshift PDF for a given object predicted by the th photo-code. We assume that (bad) is the same for all objects, and hence it will be equal to the fraction of all PDF measurements that are uninformative, which we label as bad . Therefore, for a given value of bad , the posterior PDF is given by: For a given object, the Bayesian posterior combining the information of each (participant) PDF is given by: where we introduce a parameter that provides a correction for the covariance between the different PDFs. If the PDFs are all statistically independent, the probabilities from each one would multiply, so = 1. In contrast, if they were completely covariant, we would need to match the number of PDFs being combined, , so that after multiplication and exponentiation ( , bad ) would match the result from a single PDF.
Based on tests with the CANDELS data, we find that the optimal value of is best described by the equation: which yields = 1 for = 1 and = 2.1 for = 5 (matching the results from Dahlen et al. 2013 when = 5). This implies that the covariance between results from different participating groups is nonnegligible (since > 1 is optimal in general) but also far from complete (since < ).
We then marginalize over the fraction of bad measurements to obtain a PDF for redshift alone: assuming a flat prior distribution for bad over [0, 1]. The result of this procedure is a PDF that matches the PDFs from each code when they agree but will include extra probability at the redshifts predicted by each individual PDF when they disagree. Hence, HB photo-constraints are appropriately degraded when the PDFs disagree, in a manner that is agnostic about which PDFs are most accurate.

Minimum -divergence Approach
In addition to the hierarchical Bayesian method of combining PDFs, we also test a new approach of choosing the single PDF that minimizes a measure of the total distance to the other PDFs; for the distance calculations, we consider two measures that fall in the general class of -divergence measures. In this paper, we consider -divergences calculated either by summing the absolute values of the differences between two PDFs (an ℓ 1 distance) or by taking the square root of the sum of the squares of the differences (an ℓ 2 distance) as measured at many equally spaced -values: where indicates the th redshift point at which the distances are evaluated, and we use FD abs and FD sqr to label the ℓ 1 and ℓ 2 versions of the -divergence, respectively. For two given curves 1 ( ) and 2 ( ) that are defined at the same set of discrete points, the -divergence is found by combining the difference between the values of the curves at each ordinate point, as illustrated in Figure 13. In these calculations, we Fréchet Distance P 1 P 2 ∆P Figure 13. An example illustration of the calculation of thedivergence distance measures between two photo-PDFs, 1 ( ) and 2 ( ), represented by the red and blue curves. The vertical purple hatching indicates the vertical difference between them at each point where they are defined, Δ ( ) = 1 ( ) − 2 ( ). We compute the -divergence between the two curves in two ways: with an ℓ 1 distance (i.e., the sum of the absolute values of the differences) and an ℓ 2 distance (i.e., the square root of the sum of the squared differences). perform a discrete sum over redshift points, which should closely approximate the integral calculated using a continuous PDF. In our case, six groups have provided photo-PDFs for each object in the CANDELS photometric catalogs. We then calculate the sum of the -divergence between each PDF and each of the other five results, yielding the total distance of that curve from the rest (in the case where the ℓ 2 distance is used, we sum in quadrature). We repeat this procedure for each PDF for a given object, and identify the one for which the total distance to the other PDFs is lowest; that is, the one with the mFD from the other PDFs. This curve will be the most similar to all the rest, and therefore provides a reasonable way to summarize the ensemble of PDFs. This construction is analogous to the use of the median value of an array as a summary statistic; the median minimizes the sum of the absolute values of the deviations from all points in an array.
The ℓ 1 distance (FD abs ) will be less affected by the largest excursions between two curves than the ℓ 2 distance (FD sqr ) because the latter metric squares the differences before summation, providing more weight to larger deviations. As a result, the minimum -divergence curve selected using an ℓ 1 distance metric, which we label as "mFDa" as it is based upon the sum of absolute values, is less sensitive to outlier curves than when we use an ℓ 2 metric, which we label as "mFDs" as it is based upon the sum of squared differences. The situation is analogous to the reason why the median statistic is more robust to outliers in a data array than the mean.  Figure 14. Optimized PDFs from all six participant photo-groups, as well as the results from three PDF combination methods, for a CANDELS galaxy in the GOODS-North field (ID = 1184). The HB (dark brown dashed curve) method produces a new PDF by combining information from each input probability distribution and accounting for the possibility that some are inaccurate. The minimum -divergence method identifies the PDF that has the smallest total -divergence from the other curves. When the sum of absolute values of differences is used as a distance metric, the minimum -divergence curve corresponds to the Salvato PDF for this object (mFDa, red dashed curve), whereas when the square root of the sum of the squares of separations is used to compute distances the Fontana curve is selected (mFDs, blue dashed curve). The vertical gray dashed line indicates the galaxy's spectroscopic redshift, spec = 0.971, which is consistent with all PDFs shown. Figure 14 illustrates the results of the hierarchical Bayesian and minimum -divergence approaches for a galaxy in the GOODS-North field. In this figure, we present the PDFs from the six different groups, as well as the results of each combination method. Whereas the hierarchical Bayesian method produces a completely new photo-PDF that is distinct from all of the input curves, the minimum -divergence method selects one of the original PDFs as being the best representative for a given object. For this object, the mFDa and mFDs algorithms select different PDFs, but it is more common that they agree with each other than not.

Comparing Combination Methods
For each of the three methods of combining photometric redshift PDFs applied here, we also explore how the quality of results changes when only a subset of the groups' PDFs is used as inputs. We note that some groups' PDFs performed significantly better than others when evaluated using the ℓ 2 -norm, op , NMAD , or co for both the testing set of spectroscopic redshifts and the 3D-HST grism redshifts. Given the possibility that some groups' PDFs are more useful for computing combined distributions than others, we have computed PDFs using the HB and mFD methods using only subsets of groups that yielded the lowest NMAD and co values (using the same set of groups' photo-'s for all objects). We use the label "6" for when all six groups are used, the label "5" for when the five best groups are used, the label "4" for when the four best groups are used, and finally the label "3" for when only the three best groups are used; hence, results labeled HB4 correspond to a hierarchical Bayesian combination of PDFs from the four best-performing codes.
In Figure 15, we plot results for the ℓ 2 -norm and op for the HB, mFDa, and mFDs methods with the 3, 4, 5, and 6 best-performing codes both before and after PDF optimization, averaging the results for these statistics from the testing set of spectroscopic redshifts and the 3D-HST set of grism redshifts. For the HB combination, HB3 produces the lowest ℓ 2 -norm. Note that op is zero for the HB combined PDFs by construction, as the uniform probability distribution used to model uninformative measurements yields a small amount of probability at all redshifts in the final HB combination. For the minimum -divergence combinations, mFDa4 and mFDs4 have the lowest ℓ 2 -norm values and are our preferred combinations even though mFDa6 and mFDs6 have lower op values. For all three combination methods, combinations that did not use all six codes produced the lowest ℓ 2 -norm values. Figure 16 takes the best-performing subset for each combination method (i.e., HB3, mFDa4, and mFDs4) and compares them for the training set, spectroscopic testing set, and 3D-HST testing set. It also shows the results from the individual groups for comparison. The mFDa4 and mFDs4 combinations perform similarly well and are clearly better than HB3 (and the individual groups) on the spectroscopic testing set. We prefer the mFDa4 combination because it has the lowest ℓ 2 -norm averaged across all three data sets. The mFDa4 combination has slightly higher op values than the mFDs4 combination for all three data sets. We conclude that the minimum -divergence (with the absolute value as the distance metric) PDF selected from the four highest-quality results comes closest to meeting the statistical definition of a PDF for the actual redshift of a galaxy, and therefore is the preferred combination technique for CANDELS when a consensus PDF is desired.
Although mFDa4 is our preferred combination method for PDFs, there are cases when the accuracy of point measures of the redshift is more desirable than the accuracy of the PDF. We therefore evaluate the accuracy of weight estimates from the same set of combination methods and subsets of the input PDFs considered above, using the NMAD and co statistics to evaluate them. We present the averaged results from the testing set of spectroscopic redshifts and the 3D-HST grism redshifts in Figure 17, combining objects from all CANDELS fields. Among the hierarchical Bayesian combinations considered, HB4 has the lowest NMAD value; mFDa4 and mFDs3 yield the lowest scatters among the minimumdivergence combinations. We compare the results for these three best cases against each other and the individual groups  Figure 15. Average values of the normalized ℓ 2 -norm and out-of-PDF fraction op for each PDF combination method considered here, combining results from the testing and 3D-HST redshifts across all CANDELS fields. We show values both without (open symbols at arrow bases) and with (filled symbols at arrow heads) optimization of the PDFs from each group in advance of combination. The panels differ in the method of combination considered; here "HB" indicates hierarchical Bayesian combination (top), "mFDa" indicates the minimum -divergence curve selected using an ℓ 1 (sum of absolute values) metric (middle); and "mFDs" indicates the minimum -divergence curve selected using an ℓ 2 (sum of squares) metric (bottom). The number following the combination type indicates the number of PDFs combined; e.g., in the HB3 case, we combine the PDFs from the best three groups (ranked according to their performance at point value metrics). Within each type of combination, the lowest values for the ℓ 2 -norm are obtained using the optimized PDFs for HB3, mFDa4, and mFDs4, respectively.   Figure 16. Normalized ℓ 2 -norm and op for the lowest-norm PDF combinations of each type-HB3, mFDa4, and mFDs4-with results from individual groups shown for comparison (same as Figure  10). We show results for the training and testing sets of spectroscopic redshifts and the 3D-HST grism redshifts separately in each panel but combine objects from all CANDELS fields. mFDa4 and mFDs4 produce similarly excellent results for all three data sets, give lower ℓ 2 -norm values than HB3 in every case, and outperform the individual groups on the testing set. We prefer the mFDa4 combination because it gives the lowest ℓ 2 -norm value averaged across all three data sets, indicating that the minimum -divergence curve constructed from the four highest-quality PDFs comes closest to meeting the statistical definition of a probability density function for the actual redshift of a galaxy.
using each set of redshifts separately in Figure 18. In every case, the hierarchical Bayesian combination of the best four PDFs, HB4, yields the lowest NMAD ; it is therefore the combination method that provides the most accurate point values of photo-'s, with ∼ 0.02(1 + ) or better for all three spectroscopic samples. In every case it outperforms the best individual group results, though the improvement over the best individual group (Wuyts) is small. Table 3 summarizes the performance of the optimized photo-PDFs from each group and the PDFs produced by the best combination methods of each type (HB, minimum -divergence computed with an ℓ 1 metric [mFDa], and minimum -divergence computed with an ℓ 2 metric [mFDs]) for each of the three spectroscopic data sets (training, testing, or 3D-HST) across the five CANDELS fields. The method yielding the best performance for a given statistic is highlighted in blue, and the combination method with the best performance on average is indicated in red. The quantities used to evaluate the quality of photo-PDFs are the normalized ℓ 2 -norm between the -curve for a given set of PDFs and the unity line (ℓ 2 -norm) and the fraction of spectroscopic redshifts that lie outside the photo-PDF for their corresponding object ( op ; which are zero by construction for the hierarchical Bayesian combination method). To test the performance of each group and combination method for the weighted-mean peak redshift, weight , we use the normalized median absolute deviation ( NMAD ) and the Δ /(1+ ) > 0.15 outlier fraction co . Finally, we list the fraction of objects for which a PDF file was not provided by a given group as missing (fraction of missing files). Combinations of different numbers of PDFs (either the best three or the best four) yielded the best results for PDF statistics (where HB3, mFDa4, and mFDs4 proved superior), which differed from the best ones for point statistics (where HB4, mFDa4, and mFDs3 were preferred).
The Finkelstein group's PDFs yielded the smallest average normalized ℓ 2 -norm, while the Wiklind PDFs had the smallest out-of-PDF fraction op (caused by objects with missing PDFs not being considered as out-of-PDF). If objects with missing PDFs are considered out-of-PDF, then Salvato's PDFs would have had the smallest op of the individual codes. However, the mFDa4 combination method yielded only marginally larger ℓ 2 -norm than Finkelstein while having a smaller op ; we therefore recommend the use of this combination if the most accurate PDFs are desired.
Whereas individual groups' results yielded the best performance for some PDF quality statistics, the best point statistics were obtained using combinations of multiple photo-PDFs. The HB4 (hierarchical Bayesian combination of the four best PDFs) combination yielded both the smallest average NMAD  (0.022) and the lowest average outlier rate (5.1%). However, it is worth noting that the results from the Wuyts group taken on their own were almost as accurate.
The most dramatic failures of the photo-codes are objects that are both out-of-PDF and catastrophic outlier point Table 3. Table of quantities used to assess the quality of the optimized photometric redshift PDFs and their point statistics. All values multiplied by 100. The lowest average result among the combination methods is shown in red, and the lowest value in a given row is shown in blue (unless that entry is also the lowest average result for the combination methods). We note that the Wiklind analysis has significantly higher fractions of objects with missing photo-PDFs and point estimates than the other codes, so its summary statistics are sensitive to the treatment of these objects. If objects with missing PDFs are counted as out-of-pdf, then Wiklind would no longer produce the lowest op and op & co for all three data sets. In fact, it would have the highest op for the training and testing sets (and an unremarkable op for the 3D-HST set). On the other hand, if objects with a missing point estimate are not counted as catastrophic outliers, then Wiklind would produce nearly the best co for the training and testing sets (and an unremarkable co for the 3D-HST set). estimates. Generally, there is some overlap between the outof-PDF objects and catastrophic outlier objects; but there are also a considerable number of objects that are either one or the other but not both. We note that calculation of op , co , and op & co for the Wiklind analysis in particular is sensitive to the treatment of missing PDFs/photo-'s due to the significantly higher missing for Wiklind compared to the other codes. Figure 19 summarizes the procedure from the initial photo-PDFs obtained from the six different groups ("Original") to the "Shifted" and "Optimized" versions of the individual photo-PDFs and then to the combined photo-PDFs used to generate the final photometric redshift catalogs. We provide tabulated "Original," "Optimized," and combined photo-PDFs and point estimates for all objects in the five CAN-DELS fields.

CANDELS PHOTOMETRIC REDSHIFT CATALOGS
First, we provide photo-PDFs in a separate file for every CANDELS object (with the object identifier specified in the filename, e.g., ALL_OPTIMIZED_PDFS_GOODSN_ID00001.pzd). Each file has columns specifying the redshift; the PDF provided by  Figure 18. Comparison of the best photo-point statistic results from each PDF combination method-HB4, mFDa4, and mFDa3using NMAD and co both before (open symbols at arrow bases) and after (filled symbols at arrow heads) optimization. We also show results from Figure 12 for the individual groups. The three panels show these statistics evaluated using the training set of spectroscopic redshift (top); the testing set (middle); and 3D-HST grism redshifts (bottom). In every case, the hierarchical Bayesian combination of the best four PDFs, HB4, gives the lowest NMAD values for each set and outlier fractions comparable to or lower than other methods, making this the combination that provides the most accurate point estimates. That being said, the improvement over the Wuyts results is slight. each group after the optimization procedure from Section 3 has been applied; and the PDFs resulting from the two best combination methods, HB4 and mFDa4 (hence the term ALL in the filename). The PDFs cover the redshift interval [0, 10] with a step size of Δ = 0.01. The details of the columns included in these files are presented in Table 4. We also provide files with the original photo-PDFs as provided by the six different groups before applying the optimization method. The format of these files is exactly the same as the one with the optimized PDFs, while the file name in this case is ALL_ORIGINAL_PDFS_GOODSN_ID00001.pzd. Additionally, we provide summary catalogs of objects in each CANDELS field. These catalogs contain photo-point statistics constructed from the optimized PDFs and the best combinations, as well as spectroscopic and/or grism redshifts where available. These catalogs include estimates of the 1 and 2 credible intervals for the photometric redshift constructed from the PDFs. The columns of these files are described in detail in Table 5. The catalogs are available at https://archive.stsci.edu/hlsp/candels.

SUMMARY AND DISCUSSION
In this paper, we developed a technique to measure the photo-PDFs and point estimates for galaxies in the CAN-DELS field using the final photometric catalogs. Using this technique, we present photometric redshifts for over 150,000 galaxies. We began with probability density functions measured by six groups within the collaboration by applying a variety of template-based methods to the same photometric catalogs. We determined the optimal shift and stretching/sharpening parameters for the PDFs from each group using statistics based upon the -plot, which we measured using the same training set of spectroscopic redshifts provided to each group to tune their photo-algorithms.
In tests with a training set of spectroscopic redshifts and with independent sets of spectroscopic and grism redshifts, the optimized PDFs much more closely match the statistical definition of a probability density function than those originally provided by each group, with the normalized ℓ 2 -norm statistic (a measure of the accuracy of the photo-CDF) improving by more than a factor of 2 in some cases. Point estimates of the redshift (e.g., weight ) derived from the optimized photo-PDFs also exhibit significantly smaller scatter (as measured by the normalized median absolute deviation) and smaller or negligibly worse catastrophic outlier rates, in the best cases yielding photo-errors of ∼ 0.02(1 + ).
After optimizing the results from individual groups, we have explored the gains from three different methods of combining the six PDFs available for each object: the HB method described in Dahlen et al. (2013)  Raising the shifted PDFs of each group separately to a power of 1/γ using the optimal parameter γ for each field separately Figure 19. Diagram of the optimization procedure for obtaining the final products of photo-PDFs, starting from the initially provided PDFs. First the PDFs are shifted, then raised to a power, resulting in optimized PDFs. Then the PDFs from different groups are combined using three different combination methods into final PDFs that can be used for science. PDF from Hierarchical Bayesian combination, constructed using the PDFs from the best-performing four groups #9 mFDa4 PDF from the minimum Fréchet Distance combination (computed with ℓ 1 distance metric), constructed using the PDFs from the best-performing four groups Table 4. Detailed description of the files (e.g., ALL_OPTIMIZED_PDFS_GOODSN_ID00001.pzd) containing the PDFs from each participant as well as the two best combination methods. The number of models used in the combination methods is reported in the header, as well as the value of the parameter from the Hierarchical Bayesian method. Note that while the four best participants are included in the evaluation of the combination methods, one or more participants might have missing PDFs for a given object, therefore the total number of PDFs used is not always 4. Both the original and optimized versions of PDFs are provided in separate files.  Table 5. Detailed description of the columns of the CANDELS photometric redshift catalogs (e.g., zcat_EGS_v2.0.cat), which provide point statistics constructed from the optimized photometric PDFs, as well as spectroscopic and/or grism redshifts where available, for all objects in each CANDELS field. Each CANDELS object corresponds to one row in the catalog. Statistics based upon the optimized PDFs from all six groups, as well as the two best combination methods, mFDa4 and HB4, are provided within the catalog. The full set of statistics tabulated for the minimum -divergence (mFDa4) PDF are detailed. Corresponding statistics are provided for each groups' results are included in the catalog, with the column identifier only differing in its prefix (i.e., HB4, Finkelstein, Fontana, Pforr, Salvato, Wiklind, or Wuyts) from the column identifier for mFDa4. squared differences (mFDs). We construct new PDFs by applying each method to subsets of the six results for each object. Comparing them to each other with the same statistics used to assess individual groups' results shows that combining the PDFs from the four best-performing groups produced the best results. The hierarchical Bayesian method yielded the lowest scatter in point statistics, while the minimum -divergence curve computed with an ℓ 1 metric (mFDa) had the lowest ℓ 2 -norm values, indicating that it provides the most accurate PDFs, and hence the most accurate credible intervals as well.
The optimized methods were then used to estimate photo-PDFs for galaxies in CANDELS. We constructed publicly available catalogs of optimized PDFs and photometric redshift summary statistics for all objects from the CANDELS photometric catalogs used to calculate the photo-'s. Instructions on how best to use these catalogs are as follows: K .
• In general, results from different photometric redshift codes are sufficiently different from each other (as can be seen most clearly in Figure 6) that we recommend performing an analysis multiple times using photopoint estimates or PDFs from different groups each time to ensure that conclusions are robust to these variations.
• Different columns of the summary catalog are better for different purposes. For instance, if one wants the best estimate for the redshift of an individual object (where uniformity does not matter), the z_best value from the catalog (which is determined from the combined data set of spectroscopic redshifts, 3D-HST grism redshifts, and mFDa4 photometric redshifts) would be most appropriate. If instead the smallest-scatter estimator of redshift for a uniform sample is needed, HB4_z_weight (the weight value computed from the hierarchical Bayesian combination of the four best PDFs for each object) would be most appropriate. This photopoint estimate yielded NMAD = 0.0227/0.0189 and |Δ /(1 + )| > 0.15 outlier fraction = 0.067/0.019 for the testing spec-'s and 3D-HST grism-'s, respectively. We note that redshift uncertainties will never be uniform across galaxy samples-even when restricting uniformly to using only photo-'s-because different objects have different signal-to-noise ratios and different SED shapes with more or less pronounced breaks.
• The mFDa4 (minimum -divergence curve constructed from the best four PDFs from individual groups) yielded PDFs that best meet the statistical definition of a PDF. As a consequence, this is the preferred set of PDFs to use when the accuracy of credible intervals on redshifts is desired. Correspondingly, the mFDa4_z_weight column of the summary table will have the best-characterized error estimates associated with it.
The photometric redshift catalogs presented here represent the culmination of a considerable amount of effort by the CANDELS collaboration to obtain a broad range of imaging data, measure uniform photometry with TFIT, and calculate photometric redshifts. They represent a public legacy of the survey that should contribute to a wide variety of science in the future, such as the estimation of stellar masses of galaxies.
We would like to thank Janine Pforr for performing the HyperZ fits. We also wish to acknowledge helpful discussions with Larry Wasserman, Ann Lee, Peter Freeman, and the International Computational Astrostatistics Group at Carnegie Mellon University; Rongpu Zhou; members of the CANDELS Multiwavelength Catalog Working Group and the LSST Dark Energy Science Collaboration Photometric Redshift Working Group. We appreciate the careful reading and thoughtful suggestions by the referee and the AAS Journals statistician. This work is based on observations taken by the CANDELS Multi-Cycle Treasury Program with the NASA/ESA HST and was supported by HST program No. GO-12060. Support for Program No. GO-12060 was provided by NASA through a grant from the Space Telescope Science Institute, which is operated by the Association of Universities for Research in Astronomy, Incorporated, under NASA contract NAS5-26555. D.C.K. acknowledges support from NSF grant AST-1615730. This research has made use of NASA's Astrophysics Data System. Facilities: HST(WFC3 and ACS) APPENDIX A. SPECTROSCOPIC AND GRISM REDSHIFTS For training and testing the photo-codes, we compiled spectroscopic and grism redshifts for numerous sources. Table 6 lists the original sources, the relevant CANDELS field, and any quality cuts applied. Table 6. Details of the spectroscopic redshifts and grism redshifts used in this study. For each data set used we provide the name of the survey or instrument used, where applicable; a reference for the source catalog; the number of redshifts provided in each CANDELS field; and any cuts applied in order to restrict to the most robust redshifts.
Survey / Instrument (Reference)  Table 6 continued Flag to avoid grism-'s and only keep spec-'s.

Number Of Redshifts in each field
K .   Table 6 continued Table 6 (continued) Survey / Instrument ( Table 6 continued K .