A Conditional Autoencoder for Galaxy Photometric Parameter Estimation

Astronomical photometric surveys routinely image billions of galaxies, and traditionally infer the parameters of a parametric model for each galaxy. This approach has served us well, but the computational expense of deriving a full posterior probability distribution function is a challenge for increasingly ambitious surveys. In this paper, we use deep learning methods to characterize galaxy images, training a conditional autoencoder on mock data. The autoencoder can reconstruct and denoise galaxy images via a latent space engineered to include semantically meaningful parameters, such as brightness, location, size, and shape. Our model recovers galaxy fluxes and shapes on mock data with a lower variance than the Hyper Suprime-Cam photometry pipeline, and returns reasonable answers even for inputs outside the range of its training data. When applied to data in the training range, the regression errors on all extracted parameters are nearly unbiased with a variance near the Cramr-Rao bound.


Introduction
For decades, astronomers have scanned the sky with increasingly powerful telescopes and cameras, collecting millions of digital images of billions of stars, galaxies, and other objects (York 2000;Skrutskie et al. 2006;Abell et al. 2009;Chambers 2016;Aihara et al. 2018). These "sky surveys" collect so much data that no human could ever look at it all directly, so they rely on automated software (Barbary 2016;Melchior et al. 2018) to detect objects, isolate them from their neighbors, and determine their properties. Assuming the pointspread function 4 is known, an unbiased minimum variance estimate of a star's position and flux is achievable . The case of extremely crowded stellar fields is still challenging, and a variety of techniques (Stetson 1987;Lang et al. 2016;Schlafly et al. 2018), including trans-dimensional inference (Portillo et al. 2017;Feder et al. 2020), have been applied. In this study, we focus on galaxy images, which are more varied (we do not have an adequate parametric generative model for all galaxy types) and more time-consuming to analyze and challenging to deblend. High accuracy galaxy flux and shape estimates are essential for achieving many scientific goals, such as weak lensing (Mandelbaum 2018). Some state-of-the-art processing pipelines such as IMFIT (Erwin 2015) and CModel (Ciambur 2015;Bosch et al. 2018) represent the galaxy with a five parameter Srsic model (Graham & Driver 2005) and constrain its parameters via Maximum Likelihood Estimates. However, these methods can be computationally expensive and do not always perform well in crowded images.
Machine learning (ML) presents a promising alternative to model fitting. Rather than viewing the problem via a parametric model with priors on its parameters, we can train a neural network on a training set of images paired with their labels (the parameters of a Srsic model) and a loss function that penalizes discrepancies between the true labels and predicted labels. ML has been used for classification of galaxy morphology (De La Calleja & Fuentes 2004) and fitting of surface brightness profiles (Tuccillo et al. 2018) and deep neural networks have been applied to a more difficult task in separating the contribution of two or more overlapping galaxies in an image (deblending) (Boucaud et al. 2020).
In this paper, we present a Conditional Autoencoder (see details in Section 3.2) based method. It takes in a galaxy image and creates a semantically meaningful latent space containing the galaxy parameters plus any other information required to reconstruct the input image, then outputs a denoised galaxy image. Our model successfully predicts the parameters of a Srsic profile: the Srsic index, the Srsic radius, its ellipticity, orientation, position, and flux. The model achieves high accuracy in predicting those parameters with low bias and errors near the Cramr Rao bound. These things can also be been done without a CAE (i.e., without the image reconstruction part), but the ability to output a de-noised image and represent the image in a lower dimension space is an advantage that can be used for downstream tasks such as deblending.
There is another advantage of combining parameter estimation and image reconstruction. During the training for image reconstruction (suppressing the galaxy parameter loss term), the training can get stuck at local minima that are entirely black images with some random noise. Having the galaxy parameter regression loss helps the network escape these local minima.
In addition, we see this approach as building a foundation for future work on deblending, which would require that the model be insensitive to other nearby galaxies. In this case, the reconstructed image is of the central galaxy only, and the comparison of this output with the input image is an essential part of the loss function. Although we think this is an important aspect of our model, we only apply it to isolated galaxies in this paper.
This paper is organized as follows. The background is discussed in Section 2. The main method is described in Section 3. The training results are presented in Section 4. The results of the evaluation and the conclusions are discussed in Section 5.

Simulated Galaxy Data
Our long-term goal is to apply a CAE to real data, but careful validation and performance assessment require tests on mock data. Galaxies have a wide range of intrinsic properties such as luminosity, morphology type, and size. Historically, the Hubble classification system classified galaxies as elliptical, spiral, and irregular, based on the distribution of stellar light. Elliptical galaxies have a de Vaucouleurs profile and spiral galaxies have a convex exponential profile. The Srsic profile unifies these profiles, where I is intensity at radius R, I e is intensity at half light radius R e , n is the Srsic index, and b n is a scaling factor completely determined by n. A Srsic profile with index n = 1 (b 1 = 1.678) is an exponential profile, and n = 4 (b 4 =7.669) corresponds to a de Vaucouleurs profile. This profile is then distorted by an affine transformation to represent the orientation of the galaxy with respect to the line of sight. The affine transformation includes a small random shift in position to ensure the neural network does not depend sensitively on the centering of the input galaxy.
The SDSS photopipeline (Abazajian et al. 2004) and the HSC pipeline (Bosch et al. 2018) both use the CModel photometry (Bosch et al. 2018) to describe the galaxy photometry. CModel approximates the galaxy as a linear combination of an exponential profile (n = 1) and a de Vaucouleurs profile (n = 4). In this paper we use the Srsic model with Srsic index between 0.5 and 6, as shown in Table 1.
We trained our CAE on Galsim (Rowe et al. 2015) realizations of the Srsic profile to represent galaxies, convolved with the Moffat profile for the point-spread function (PSF). After convolving the galaxy with its PSF, we add a constant level of Gaussian noise to the image.
The seven key physical parameters we used to characterize the galaxy image are its flux, the Srsic index and radius, its orientation and ellipticity and the positions (x and y) of the galaxy. In practice, the orientation and ellipticity are specified by the g 1 , g 2 parameters (Kilbinger 2015), where q is the ellipticity and f is the orientation (position angle) of the galaxy. The flux is the total number of photoelectrons in the image before adding noise, summed over all pixels, and it is proportional to the intensity of the image. The range of values for the seven parameters are shown in Table 1.
The following additional parameters are used to generate the training and testing images: 1. nPixels = 64. Each galaxy image is 64 by 64 pixels. 2. PixelScale = 0.23. The pixel scale of the image is 0 23 pixel −1 . 3. PSFBeta = 2 and PSFRe is drawn randomly and uniformly between 0.5 and 1. The PSF has a Moffat profile, with a β parameter equal to 2 and the full-width-halfmaximum (FWHM) varies randomly between 0 5 and 1″. The beta predicted from atmospheric turbulence theory is 4.765 (Trujillo et al. 2001), but the values measured from astronomical images are often closer to 2. Indeed, the PSF of a ground-based telescope often departs from a Moffat profile, is not perfectly round, and may vary across the image. Throughout this work we disregard these complexities and use a Moffat profile with β = 2, with the caveat that future work applying such a neural net to real data may need to consider a more general form of PSF variation.
A total of 40,000 sets of physical parameters were randomly drawn from the ranges described in Table 1, and used to generate 40,000 galaxy images. These images were used to train and test the performance of the neural network. Examples of images obtained for various physical parameters are shown in Section 4.

Data Preprocessing
The Srsic shape parameters (e.g., index and radius) are somewhat degenerate with the PSF FWHM, so the point-spread function (PSF) must be included in the input data. We preprocessed the galaxy image and a centered, normalized image of the PSF in six ways to form a six-channel input data, as illustrated in Figure 1 where i is the raw noised galaxy image, σ is the pixel-wise Gaussian noise level, and var is the variance across the image (including signal and noise).
Concatenating various normalized data as input helps the CAE be more robust against noise changes and encourages disentangling between flux and noise levels. Since the PSF is nearly degenerate with the Srsic radius, we input PSF and PSF 2 to help disentangle the latent representations of Srsic radius from the PSF, and thus improve the model's physical parameter regression accuracy in the latent space. Mathematically speaking, the input data contains redundant information, but we find that inputting this redundancy into the network improves the training speed and accuracy.

Autoencoders
An autoencoder (Ballard 1987;Hinton & Salakhutdinov 2006;Baldi 2012) is a bottleneck-shaped feed-forward neural network that performs unsupervised learning. The goal of an autoencoder is to learn a compressed representation (encoding) of the unlabeled data (Hinton & Salakhutdinov 2006). An autoencoder contains two parts, the encoder and the decoder. The encoder encodes the high-dimensional input (e.g., pixel values in an image) into a low-dimensional representation (latent representation). Note that a linear version of an encoder is equivalent to principal component analysis. The decoder decodes the latent representation to reconstruct an approximation of the input with greatly reduced noise (denoising autoencoder) (Vincent et al. 2008).

CAE Architecture
In this work we use an architecture we call the conditional autoencoder (CAE). This is similar to the conditional variational autoencoder (CVAE) (Sohn et al. 2015), which conditions its mapping from the inputs to the latent space, and from the latent space to the outputs, on some additional parameter (for example an input label). The goal of our neural network is to output parameters and a reconstructed image, not distributions. That is, it lacks the variational property of a CVAE, so we remove "variational" from the name. However, our CAE is conditional on the image mean, which is passed directly to a layer following the convolutional layers, so we retain the word "conditional" in the name. In addition, our CAE encourages some of the latent variables to correspond to data labels, in this case, the Srsic parameters of the galaxy images. As a result, the encoder part of the CAE can output photometric parameters of interest.
Our model is illustrated in Figure 1. The encoder encodes a galaxy image into a semantically meaningful latent space, which includes the physical parameters that characterize the galaxy. The decoder uses the extracted latent representation to reconstruct the noiseless galaxy image. Note that the ground truth of the decoder output is the noiseless galaxy image convolved with PSF. The encoder has nine convolutional layers (Krizhevsky et al. 2012) followed by one fully connected layer, which has one extra neuron added to input the flux information. Conditioning on mean image flux improves the accuracy of the physical parameters in the latent space. The latent space dimension is 144, of which 7 are constrained to be physical parameters. The decoder is composed of one fully connected layer, followed by six transpose convolutional layers with stride 2 to upscale. Details are available on GitHub 5 .

Loss Function
The loss function contains three terms: galaxy image reconstruction, physical parameter L1 regression, and the L1norm of the remaining latent parameters. The L1-norm is used to promote sparsity and avoid overfitting. We use Î m y 64 64 to denote the μ-th reconstructed galaxy image and y μ* to denote the input galaxy image without noise. The latent parameters, Î m  Z n Latent , consist of two parts, corresponds to the physical latent parameters, and the rest is denoted as m z R . We also denote the ground truth of the physical latent parameters as m * z P . The loss function in terms of these variables is formulated as follows, The total image variance contains a contribution from Poisson noise (the sum of the pixel values, y i ) and variance from all other sources (sky background, readout electronics, etc.), σ 2 . We define the weight coefficient, α to be the inverse of this variance. The galaxy reconstruction loss is weighted by inverse of the image total counts and the Gaussian noise level. This gives the NN a relatively low error tolerance for reconstructing low SNR images and a high error tolerance for reconstructing high SNR images. The physical parameter regression loss puts constraints on some of the latent parameters. It forces 7 of the latent parameters to contain the physical parameters that describe a galaxy. During training, we alternate the loss function between L1 and L2 loss to further improve the result (Zhao et al. 2016).
We construct the NN with PyTorch (Paszke et al. 2019). We train the NN with the Adam optimizer (Kingma & Ba 2014) with an initial learning rate 0.01, and divide the learning rate by 10 at epochs 30, 100, 500 and 800. We used early stopping and the training takes about 10 GPU hr. The GPU we used is the NVIDIA Tesla V100. We first set the scaling factors β and λ small, as 10 −3 , and trained the network for 2 hr to mainly optimize the galaxy reconstruction result. Then we increased β and λ gradually to β = 2 and λ = 1 and trained the network for another 8 hr to optimize both the galaxy reconstruction and the physical parameter regression results.

Results
The CAE model achieves high performance in reconstructing galaxy images while extracting their physical parameters in latent space. The optimal latent dimension is found to be 34 by cross validation. The physical parameter regression results are shown in Figure 2 and the correlation between the predicted physical parameters and the true values are summarized in Table. 2. Examples of galaxy reconstruction are shown in Figure 3. We evaluate our results by comparing it with the HSC photometry pipeline performance and the statistical limit calculated by finding the Cramr Rao Bound (CRB).
As stated in the HSC photometry pipeline paper, the pipeline "achieves a 13% photometric precision at i ∼ 20.0 mag and a 18% precision at i ∼ 25.0 in the i band (corresponding to statistical scatters of ∼0.15 and ∼0.22 mag respectively" for synthetic galaxies, with single-Srsic profiles, injected into HSC images (Huang et al. 2018). Thus the root mean squared (rms) error is 0.22 for galaxies with SNR = 10 and 0.08 for galaxies with SNR = 1000. Our CAE model achieves an rms at 0.18 for galaxies with SNR equal to 10 and 0.05 for galaxies with SNR equal to 100 and therefore outperforms the HSC photometry 5 https://github.com/JunYinDM/GalaxyGenerativeModel  pipeline (Bosch et al. 2018), at least in terms of flux estimates. These comparisons are also illustrated in Figure 4. Note that Huang et al. (2018) evaluated galaxies that are injected into an HSC image, a more realistic and complicated setting, which may explain where their rms error is greater than ours. This raises the question: how well should we expect to be able to estimate flux and the other parameters? The information content of the data (in the Gaussian case) is expressed by the Fisher Information Matrix: is the log-likelihood of observing data D given parameters θ and model X(θ), in our case a sum over image pixels i, and the expectation E D averages over realizations of the data. We assume uncorrelated noise between pixels. The inverse of this matrix represents the curvature of the log likelihood surface near the peak, and provides a theoretical lower bound on the uncertainty of the estimate of θ. This lower bound is known as the Cramr-Rao bound (CRB), which for an unbiased estimator is given by Even when we only have mock data, we can estimate the CRB by substituting a model for D. Let X(θ) be a Srsic model Figure 3. Galaxy reconstruction results. First row: galaxy image without noise. Note that noiseless image is not an input image for either training or testing. Second row: galaxy image with noise. This is the CAE input image. Third row: galaxy reconstructed by decoder. Fourth row: the difference between the reconstructed image and the original noiseless galaxy. The difference is small compared to the input image (by an order of magnitude) and has an rms usually below the noise level.
image evaluated for parameters θ. To evaluate the CRB at a specific point in the parameter space, θ 0 , we evaluate the model there, X(θ 0 ), and consider the partial derivatives of the loglikelihood with respect to θ. In this case, the CRB for an estimateq is We used a finite difference method to approximate the derivatives. The θ we used to calculate CRB is shown in Table 3. We used these parameters to make 1000 galaxy image realizations with Gaussian noise. Then we extracted the physical parameters by using our trained model. We compared the CRB with the network performance for galaxy images with SNR equal to 30 and 60. The results are summarized in Table 3 and illustrated in Figure 5. Our prediction errors are close to the CRB for all parameters except the Srsic index. It indicates that the neural network is using nearly all relevant information in the image. The CRB is the minimum variance bound for an unbiased estimator. For the Srsic index, the rms is smaller than the CRB, but the estimate is biased.

Results on Galaxies in CANDELS
We tested our algorithm on the real galaxy images from CANDELS data set. This data set 6 is based on the CANDELS bulge/disk decomposition catalog and images. We chose this data set because HST images have high resolution and accurate flux, radius and shape parameters. Thus we could add a known PSF and noise to the galaxy image for model training and use known label for model evaluation.
We use data taken in the F160 filter, (1.6 microns, roughly H-band) and zero-point = 25.94 for H_F160W. We discard irregular galaxies since the Srsic index cannot be assigned to them. The remaining data set has about about 800 spheroid and disk galaxies. The original pixel size was 0 08, and we rebinned the images to have pixel size 0 16, and then  Note. The CAE column illustrates the prediction errors, calculated by prediction minus true value, and the prediction uncertainty. The uncertainty is close to the CRB in most cases. For the Srsic index, the rms is smaller than the CRB, but the estimate is biased.
smoothed with a Gaussian PSF with FWHM 0 5. These numbers were chosen to be similar to the simulated data used in the original neural net training. We used transfer learning to fine-tune the CAE model. We retrained the CAE model on CANDELS galaxy images with the pretrained model weights, from training results on the simulated galaxies. In the retraining, we kept the extra 137 latent dimensions, along with the same regularization as before. The result was that the number of non-zero latents remained at 34. It took 10 minutes GPU time, 200 epochs, for the loss to converge. The model achieves high performance in predicting flux and radius, as shown in Figure 6. The mean error for predicting flux is 494 and the standard error is 5110. The mean error for predicting radius is 0 015 and the standard error is 0 14. For Srsic index, the CANDELS data set has label n = 1 for spheroid galaxies and n = 4 for disk galaxies. In reality, the galaxy can have non-discrete Srsic index. The CAE model predicts a continuous number for the Srsic index. The results for Srsic index prediction are shown in Figure 7. The CAE model also reconstructs denoised galaxy images with residuals typically <10%, as shown in Figure 8. The difference between the CAE reconstructed galaxy image and input is  indistinguishable from the noise map, as desired. The difference between the input reconstructed galaxy image and the noiseless galaxy is small. This test on real CANDELS data suggests that the type of CAE developed herein may be applicable to a wide variety of galaxy observations, after a small amount of retraining.

Conclusion
In this paper, we developed an ML model that estimates Srsic parameters from simulated galaxy images with a size and noise level representative of large ground-based surveys. With a view to extending this work to deblending galaxy images in the future, we chose an approach (a conditional autoencoder) that outputs both a reconstructed image and the galaxy parameters. We find that the image reconstruction is acceptable (residuals at the ∼10% level, see Figure 3) and the parameter estimates are reliable. The bias is approximately 0.5σ for Srsic n, and less for the other parameters. We compare the uncertainty for each parameter to the theoretical minimum (the Cramr-Rao bound) and find good agreement ( Figure 5). This performance compares favorably to the published performance of a cutting-edge conventional pipeline from 2018 (Huang et al. 2018). We note that that work injected galaxies into images with realistic backgrounds, a more challenging task than we undertake in this paper. A fair comparison would require running on the same data.
The application of the CAE to simulated data is only a first step. In order to demonstrate that this approach will work on real galaxies, we applied it to a sample of galaxies observed by HST from the CANDELS survey (Section 4.1). These galaxy images are very sharp and low-noise compared to the simulated training data, so they play the role of noiseless, "ground truth" images in the comparison. We then smooth and add noise to make them similar to the simulated training data. Real galaxies deviate from a simple Srsic profile of course, so a modest amount of retraining of the CAE is necessary. We show that the reconstruction and parameter estimates for the CANDELS images are of similar quality to those for simulated images.
In the future, we plan to develop this CAE into a galaxy deblender that takes an image of blended galaxies as an input, and outputs photometric parameters and a denoised image for each galaxy.
It is a pleasure to acknowledge Ben Johnson and Sandro Tacchella who provided useful discussions. J.E.Y. is partially supported by the Harvard Data Science Initiative. D.J.E. is supported by U.S. Department of Energy grant DE-SC0013718, by a JWST/NIRCam contract to the University of Arizona, NAS5-02015, by NASA ROSES grant 12-EUCLID12-0004, and as a Simons Foundation Investigator. D.P.F. is partially supported by NSF grant AST-1614941.