Unsupervised learning and data clustering for the construction of Galaxy Catalogs in the Dark Energy Survey

Large scale astronomical surveys continue to increase their depth and scale, providing new opportunities to observe large numbers of celestial objects with ever increasing precision. At the same time, the sheer scale of ongoing and future surveys pose formidable challenges to classify astronomical objects. Pioneering efforts on this front include the citizen science approach adopted by the Sloan Digital Sky Survey (SDSS). These SDSS datasets have been used recently to train neural network models to classify galaxies in the Dark Energy Survey (DES) that overlap the footprint of both surveys. While this represents a significant step to classify unlabeled images of astrophysical objects in DES, the key issue at heart still remains, i.e., the classification of unlabelled DES galaxies that have not been observed in previous surveys. To start addressing this timely and pressing matter, we demonstrate that knowledge from deep learning algorithms trained with real-object images can be transferred to classify elliptical and spiral galaxies that overlap both SDSS and DES surveys, achieving state-of-the-art accuracy 99.6%. More importantly, to initiate the characterization of unlabelled DES galaxies that have not been observed in previous surveys, we demonstrate that our neural network model can also be used for unsupervised clustering, grouping together unlabeled DES galaxies into spiral and elliptical types. We showcase the application of this novel approach by classifying over ten thousand unlabelled DES galaxies into spiral and elliptical classes. We conclude by showing that unsupervised clustering can be combined with recursive training to start creating large-scale DES galaxy catalogs in preparation for the Large Synoptic Survey Telescope era.


I. INTRODUCTION
Large scale astronomical surveys provide key insights into the large scale structure of the Universe, its geometry and evolution in cosmic time.As the depth and scale of these surveys continue to increase in years to come, they will push back the frontiers of our understanding of dark matter and dark energy [1][2][3][4].
One of the best observational probes to study the expansion of the universe, and thereby of dark energy, is through the observation of distant Type Ia supernovae.By observing thousands of these celestial objects, it may be possible to map the distance-redshift curve over a wide range of redshifts with unprecedented accuracy [5].
In addition to this purely electromagnetic probe, the combination of gravitational wave observations with large scale galaxy catalogs has enabled the first gravitational wave standard-siren measurement of the Hubble constant [6].This approach, envisioned by Schutz [7], has also been explored without assuming the existence of an electromagnetic counterpart [8].This latter approach now opens up the way to using a large number of ground-based gravitational wave observations of binary black hole mergers, which may not have an electromag-netic counterpart, to enable precision gravitational wave cosmological measurements of the Hubble constant up to redshift z ∼ < 1.Similar observations at higher redshifts may be possible in the context of supermassive black hole mergers with space-borne gravitational wave missions [9].Since the gravitational wave observation of binary black hole mergers has now become a common occurrence [10], the next frontier to enable this science is the construction of galaxy catalogs at higher redshifts.
To realize this science, we need to address outstanding data science challenge regarding the sheer scale of ongoing and future surveys.For instance, the Sloan Digital Sky Survey (SDSS) [11] observed hundreds of thousands of galaxies, which were classified through a remarkably successful citizens science program.Follow up analyses using deep learning to classify these celestial objects based on their morphology have also reported excellent results [12].
In the context of the Dark Energy Survey (DES) [13], we now have a unique opportunity to innovate by creating new types of signal processing tools that are adequate to classify hundreds of millions of unlabeled galaxies.For instance, in [14] the authors designed a neural network model, and trained it with SDSS galaxy images that overlap the footprint of both SDSS and DES, reporting classification accuracies ∼ 95%.
In this article, we present novel methods to further advance the goal of creating large-scale galaxy catalogs, using DES as a driver for these studies.
• We demonstrate that knowledge from deep learning arXiv:1812.02183v1[astro-ph.IM] 5 Dec 2018 algorithms, trained for real-world object recognition, can be transferred to classify SDSS galaxies into spiral and elliptical classes with state-of-the-art accuracies 99.8%.
• We used the aforementioned Deep Transfer Learning SDSS-seed neural network model to demonstrate that unlabelled DES galaxies, that overlap the footprint of SDSS, are correctly classified into spiral and elliptical classes with state-of-the-art accuracies 99.6%.• We introduce the first application of unsupervised learning and data clustering to classify over ten thousand unlabeled DES galaxies, that have not been observed in previous surveys, into spiral and elliptical classes.
The method we introduce herein can be combined with recursive training to produce large-scale galaxy catalogs in DES, i.e., once unlabelled DES galaxies are clustered into spiral and elliptical classes, these new labelled datasets can be used to retrain the original deep learning algorithm, boosting its accuracy and robustness to classify unlabeled DES galaxies in bulk in new regions of parameter space, as we demonstrate in Section III.
To address the case of high-redshift galaxy classification, one may use data augmentation to prepare datasets that resemble galaxies at higher redshifts by contaminating them with noise anomalies.Thereafter, one can design and train a neural network that first denoises images, and upon identifying potential candidate galaxies, it clusters them according to their features [15].This approach lays the foundations to exploit transfer learning, unsupervised clustering and recursive training to produce large-scale galaxy catalogs in the Large Synoptic Survey Telescope (LSST) [16].
This paper is organized as follows.Section II presents the approach followed to curate the datasets and deep learning algorithms designed and trained for our analyses.In section III, we demonstrate the applicability of our methods to classify galaxies in SDSS, galaxies that overlap SDSS and DES, and finally, the applicability of our approach to correctly classify thousands of unlabelled DES galaxies through unsupervised clustering.Finally, section IV summarizes our findings and future directions of work.

II. METHODS
In this section we describe the SDSS and DES datasets used to train and test our deep learning algorithms.We also describe the design and construction of the neural network models used for unsupervised learning and data clustering used to classify unlabelled DES galaxy images.

A. Data Curation for SDSS and DES
We use a subset of SDSS Data Release (DR) 7 images for which we have high confidence classifications through the Galaxy Zoo project.We then divide these images into three orthogonal datasets for training, validation and testing.The validation set is used to monitor the accuracy and loss when training and fine-tuning our deep neural network.The test set is carefully constructed so that each image lies in both the SDSS and the DES footprint.This latter dataset was obtained from the DES DR1 data release.We have labelled these image datasets High Probability (HP) Test Sets, and there are two versions, one for each survey, i.e., HP SDSS and HP DES.
Furthermore, we created a second test set consisting of almost all galaxies that lie in both SDSS and DES footprints, this time without imposing any threshold on the Galaxy Zoo classification confidence.These datasets are labelled Full Overlap (FO) Test Sets, and again there are two versions, i.e., FO SDSS and FO DES.
The properties of these datasets are summarized in Table I, while their probability distributions are presented in Fig. 1.A sample of the training SDSS dataset, and the HP Test set images are presented in the top and bottom panels of Fig. 2, respectively.Notice that the probability cutoffs are different to get similar numbers of spiral and elliptical SDSS galaxies for training.

SDSS Dataset
We used the de-biased probabilities for elliptical and combined spiral classes described in Table 2 of [17] to create labels for the two classes of our training and test sets.After selecting the OBJIDs from Table 2 based on the probability thresholds of 0.985 and 0.926 for spirals and ellipticals respectively, we submit SQL queries to the SDSS Skyserver [18] to obtain g, r and z-band images and metadata from the PhotoObj table.
Thereafter, each galaxy is 'cut-out' from the downloaded telescope fits files for each band.
Bearing in mind that the neural network model we are using for transfer learning (ImageNet dataset [19] from 2014 with the Xception model [20]) was originally trained with images of size 224 × 224 × 3, we have resized all the galaxy sub-images to be 224 × 224 pixels using the scikit-image library [21], and then stacked the three filters together to create a color image of size 224 × 224 × 3. Finally, these sub-images are mean subtracted and normalized to convert the pixel values to the -1 to 1 range centered around 0, following best practices of neural network training [22].These curated datasets serve as the input tensor into our deep neural network model.
We developed all these scripts, to download and preprocess data, as open source, Python software stack.To facilitate and streamline these tasks at scale, we incorporated Message Passing Interface (MPI) [23] to exploit multiple nodes on supercomputers for a fast parallel computation.In our case, the data extraction and curation was done using the Blue Waters Supercomputer [24].DES Dataset The same steps are repeated to first select the DES DR1 metadata and images from the NCSA DESaccess web [25], and then to cut-out, preprocess and stack the filters together to create a lupton RGB of size 224 × 224 × 3 using the Astropy branch [26].Additionally, the Astropy package match to catalog sky is used to crossmatch DES and SDSS catalogues to within 1 arcsec.Finally we pick a random sample of ∼ 10, 000 bright DES galaxies to quantify the clustering performance of our neural network model.For the classification problem we do transfer learning starting with the Xception model [20], which has been pre-trained with the ImageNet [19] dataset.We choose this neural network model because it outperforms other state-of-the-art neural network models, including Inception-v3 [27], ResNet-152 [28] and VGG16 [29].More importantly, we carried out several experiments using all these architecture and found that Xception exhibits better performance on our validation and testing galaxy datasets.The deep learning APIs used are Keras [30] and Tensorflow [31].
For training, we first extract the bottleneck features of our training set for one or two epochs and feed them into a few custom defined fully connected layers added at the end of the pre-trained model (see Figure 6 in Appendix A).Then we progressively unfreeze the earlier layers of the whole network and fine tune their weights for a few epochs of training.The rationale behind this approach is that the earlier layers of a trained network are We train the network using Tesla P100 GPUs on XSEDE (Bridges) [33].The training process for the dataset of 36500 images is completed within 3 hours.We use categorical cross entropy as the loss function together with ADAM optimizer [34].To avoid over-fitting, we monitor both training and validation losses, add a dropout rate of 70% between our fully connected layers, and also use early-stopping, i.e. we stop training once validation loss stops decreasing.Additionally we use the learning rate scheduler, i.e., we reduce the learning rate when training loss stops decreasing to do a more finegrained search of the loss function's minima, and data augmentation.For data augmentation we use random flips, rotations, zooms and shifts as shown (see Figure 7 in Appendix B).After training, all the weights are frozen and saved, and inference on about 10,000 test images is completed within 10 minutes using a single Tesla P100 GPU.
The last layer of the network has two softmax nodes, which provide the output probability that the input image belongs to each class.While these probabilities can be directly tested for crossmatched DES sets by comparing to the SDSS-Galaxy Zoo probabilities, for the rest of the unlabelled DES images this is not possible.Given that for large-scale galaxy catalogs it would be unfeasible to inspect individual images to determine what class they belong to, we propose to use neural networks as feature extractors.In practice, we can use the nodes of the second last layer of the neural network to determine what combination of nodes is activated for each galaxy type.In this approach, the activation vectors of this layer would form two distinct clusters, for each galaxy type in a 1024-D space.
In order to visualize these 1024-D clusters, we embed them into a 3-D parameter space using the sklearn library implementation of t-Distributed Stochastic Neighbor Embedding (t-SNE) [35].For HP SDSS and HP DES test sets, we label the points using the ground-truth label of each galaxy, and find that the points neatly cluster into two groups with accuracies ∼ > 99%.
For unlabelled DES sets, we find again that two distinct clusters are formed.Based on the accuracy of the test set, we heuristically know that these clusters have accuracies ∼ > 96 for the top-half most confident predictions.One can then pick the high confidence predictions from each cluster, and assign them the corresponding galaxy label, thereby creating newly labelled DES galaxy datasets.Having quantified the accuracy of our neural network model on a DES test set that overlaps the SDSS footprint, we now use our model as a feature extractor by feeding bright, unlabelled DES galaxies that do not overlap the SDSS footprint.We use our neural network model to quantify the probability that these images represent either spiral or elliptical galaxies.A random sample of high confidence predictions is shown in Figure 8 in Appendix C. We test the robustness of these predictions by clustering all these unlabelled galaxies in a 1024-D parameter space based on the information extracted by the neural network model based on their morphology in 3 different bands, but visualized in 3-D using t-SNE, as FIG.3: Confusion matrices that indicate the accuracy of our neural network model for classifications tasks on various test sets.Note that in all cases our deep transfer learning model reports accuracies ∼ > 96%.

As shown in
shown in Figure 4.The results presented in Figure 4 indicate that the neural network model has extracted the necessary information from the training dataset to enable t-SNE to clearly identify two distinct classes of galaxies.A scientific visualization of this clustering algorithm for the FO DES test set is presented in [36].
Recursive training Having labeled about 10,000 DES galaxies through unsupervised clustering, we pick the top 1000 spiral and top 1000 elliptical galaxies.We then add them to our original SDSS training dataset, and use deep transfer learning again to re-train the neural network model.The top-and bottom-left panels in Figure 5 show the initial (0th recursion) accuracy of our classifier, and the accuracy attained once the newly labelled DES images are added to the SDSS training dataset (1st recursion).We notice that the accuracy for classification for FO SDSS and DES test sets improves up to 2%.In particular, we notice that the classification accuracy for the FO DES test set is now boosted up to 98% when 50% of the dataset is considered.This is rather significant given that this newly labelled DES dataset represent ∼ 5% of the the original SDSS training dataset.The f1 score shown in the top-and bottom-right panels of Figure 5 is a single number statistical evaluation metric that measures the accuracy of binary classification by taking a weighted average of precision and recall.It varies between its worst performance value of 0 and best performance value of 1, and is given by For binary classification, precision is the number of true positives divided by the total number of predicted positives, i.e., true positives plus false positives.Similarly, recall is the number of true positives divided by the total number of actual positives, i.e., true positives plus false negatives.We noticed that the f1 score also improves when we include new DES images to the SDSS training dataset.
This novel approach provides us with the means to enhance SDSS galaxy classification, as shown in the top left panel of Figure 5.More importantly, it provides a way forward to gradually replace SDSS galaxy images in the training dataset that we need to construct DES galaxy catalogs at scale.A DES-only image training dataset will better capture the nature of images observed by DES, and would also enable us to better use data augmentations to model the effects of noise, making our neural network model more resilient to accurately classify galaxies at higher redshift, or that are contaminated by various sources of noise.

IV. CONCLUSION
We have presented the first application of deep transfer learning for the classification of DES galaxies that overlap the footprint of the SDSS survey, achieving state-of-theart accuracies ∼ > 99.6%.We have also introduced the use of unsupervised clustering for the classification of DES galaxies that had not been observed in previous surveys, and had thereby remained unlabelled.
We have demonstrated that unsupervised clustering provides a meaningful classification of DES galaxies using as raw information morphological features abstracted from DES images in three different filters.To get insights into the inner workings of our clustering algorithm, we have presented a scientific visualization of the clustering of the FO DES test set, which is available at [36].Through this visualization we have found that seemingly incorrect labels provided by our neural network model are often actually correct.It seems to be the case that the source of this issue stems from inaccurate human classifications in our SDSS training dataset.
Finally, we have shown that newly labelled DES datasets can be used to do recursive training, providing the means to gradually replace SDSS images we have used in our training dataset.This method will enable the creation of DES-only images to train, validate and test neural network models for the creation of large-scale DES galaxy catalogs, which are needed for immediate gravitational wave standard-siren measurements of the Hubble constant, and will provide input data to create galaxy catalogs in the Large Synoptic Survey Telescope era.The scalability of this algorithm, and the minimum computational power required for these analyses, promote it as an ideal tool for future analyses of this nature.Center for Cosmology and Astro-Particle Physics at the Ohio State University, the Mitchell Institute for Fundamental Physics and Astronomy at Texas A&M University, Financiadora de Estudos e Projetos, Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro, Conselho Nacional de Desenvolvimento Científico e Tecnológico and the Ministério da Ciência, Tecnologia e Inovação, the Deutsche Forschungsgemeinschaft and the Collaborating Institutions in the Dark Energy Survey.
The Collaborating Institutions are Argonne National Laboratory, the University of California at Santa Cruz, the University of Cambridge, Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas-Madrid, the University of Chicago, University College London, the DES-Brazil Consortium, the University of Edinburgh, the Eidgenössische Technische Hochschule (ETH) Zürich, Fermi National Accelerator Laboratory, the University of Illinois at Urbana-Champaign, the Institut de Ciències de l'Espai (IEEC/CSIC), the Institut de Física d'Altes Energies, Lawrence Berkeley National Laboratory, the Ludwig-Maximilians Universität München and the associated Excellence Cluster Universe, the University of Michigan, the National Optical Astronomy Observatory, the University of Nottingham, The Ohio State University, the OzDES Membership Consortium, the University of Pennsylvania, the University of Portsmouth, SLAC National Accelerator Laboratory,

FIG. 1 :
FIG. 1: Violin Plots of Galaxy Zoo ProbabilityDistributions for galaxies in each dataset.

FIG. 2 :
FIG. 2: Top panels: labelled images of the SDSS training set.Bottom panels: sample of galaxies from SDSS-DR7 and the corresponding crossmatched galaxies from DES DR1.

FIG. 4 :FIG. 5 :
FIG. 4: t-SNE visualization of the clustering of HP SDSS and DES test sets, and unlabelled DES test.

TABLE I :
Summary of each Dataset set.
Table II, our Xception deep transfer learning neural network model attains accuracies ∼ > 99% for the HP SDSS and DES test sets.Sorting the FO test sets by the highest output probabilities, our neural network model reaches accuracies ∼ > 96% using 50% of the dataset.The confusion matrices are shown in Figure 3.

TABLE II :
Classification accuracy for each test dataset.