Galaxy classification: A machine learning analysis of GAMA catalogue data

We present a machine learning analysis of five labelled galaxy catalogues from the Galaxy And Mass Assembly (GAMA): The SersicCatVIKING and SersicCatUKIDSS catalogues containing morphological features, the GaussFitSimple catalogue containing spectroscopic features, the MagPhys catalogue including physical parameters for galaxies, and the Lambdar catalogue, which contains photometric measurements. Extending work previously presented at the ESANN 2018 conference - in an analysis based on Generalized Relevance Matrix Learning Vector Quantization and Random Forests - we find that neither the data from the individual catalogues nor a combined dataset based on all 5 catalogues fully supports the visual-inspection-based galaxy classification scheme employed to categorise the galaxies. In particular, only one class, the Little Blue Spheroids, is consistently separable from the other classes. To aid further insight into the nature of the employed visual-based classification scheme with respect to physical and morphological features, we present the galaxy parameters that are discriminative for the achieved class distinctions.


Introduction
Telescope images of galaxies reveal a multitude of appearances, ranging from smooth elliptical galaxies, through disk-like galaxies with spiral arms, to more irregular shapes. The study of morphological galaxy classification plays an important role in astronomy: the frequency and spatial distribution of galaxy types provide valuable information for the understanding of galaxy formation and evolution [1,2].
The assignment of morphological classes to observed galaxies is a task which is commonly handled by astronomers. As manual labelling of galaxies is time consuming and expert-devised classification schemes may be subject to cognitive biases, machine learning techniques have great potential to advance astronomy by: 1) investigating automatic classification strategies, and 2) by evaluating to which extent existing classification schemes are supported by the observational data.
In this work, we extend a previous analysis [3] to make a contribution along both lines by analysing several galaxy catalogues which have been annotated using a recent classification scheme proposed by Kelvin et al. [4]. In our previous study, we assessed whether this scheme is consistent with a galaxy catalogue containing 42 astronomical parameters from the Galaxy And Mass Assembly (GAMA, [5]) by performing both an unsupervised and a supervised analysis with prototype-based methods. We assessed whether class structure can be recovered by a clustering of the data generated by the unsupervised Self-Organizing Map (SOM) [6], and investigated if the morphological classification can be reproduced by Generalized Relevance Matrix Learning Vector Quantization (GMLVQ) [7], a powerful supervised prototype-based method [8] chosen for its capability to not only provide classification boundaries and class-representative prototypes, but also feature relevances. Finding consis-tently negative results for the supervised and unsupervised method, namely an intermediate classification accuracy of GMLVQ of around 73% and no clear-cut agreements between galaxy classes and SOM-clustering results, we concluded the classification scheme to be not fully supported by the considered galaxy catalogue. As discussed previously [3] the hypothesised misalignment between galaxy data and classification scheme could be explained by lack of discriminative power of the employed classifiers or clustering methods, by mis-labellings of certain galaxies (a possibility already discussed in [9]), or by the absence of essential parameters in the data set. In this work, we address two of the mentioned aspects: We employ an additional established and flexible classifier, Random Forests [10] to collect evidence that the previously found moderate classification performance is not due to shortcomings of GMLVQ. Furthermore, we address the potential incompleteness of the previously analysed dataset by performing another set of supervised analyses on several additional galaxy catalogues from the GAMA survey [11], which contain a multitude of additional photometric, spectroscopic and morphological measurements.
Despite the commonly quoted abundance of data in astronomy, well-accepted benchmark datasets are not readily available in the field of galaxy classification, and only a few works analysing GAMA catalogues with machine learning methods exist. In an analysis by Sreejith et al. [9], 10 [12]).
In agreement with our previous results and the analyses from the above mentioned literature, we find the employed classification scheme to not be fully 3 supported even when considering the additional catalogues and an alternative classifier. Interestingly, analogous to our previous work [3], the Little Blue Spheroids, a galaxy class newly introduced in [4], remains most clearly pronounced, also for the set of catalogues analysed in this work. We present the parameters that are the most relevant for the achieved class distinctions.
The paper is organised in as follows: In Section 2 the analysed galaxy catalogues and their preprocessing is described. Section 3 outlines the employed classification methods, GMLVQ and Random Forests. Section 4 describes experimental setups and results. The work closes with a discussion in Section 5.
This paper constitutes an extension of our contribution to the 26 th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) 2018 [3]. Parts of the text have been taken over literally without explicit notice. This concerns, among others, parts of the introduction and the description of GMLVQ in Section 3.

Data
In this work we analyse data from five galaxy catalogues (Table 1) containing features which have been derived from spectroscopic and photometric observations, i.e. measurements of flux intensities in different wavelength bands from the Galaxy And Mass Assembly (GAMA) survey [11] for a sample of 1295 galaxies. As the catalogues contain information for different sets of galaxies, our data set consists of the set of galaxies for which a full set of features is available after balancing the relevant classes (cf. Section 2.6).
To determine this set, each catalogue is first cross-referenced with the galaxy sample analysed in our ESANN contribution [9,3]  dimension) and then discarding samples which contain missing values in any of the remaining feature dimensions.
Details of each catalogue as well as specific processing steps are delineated in the following paragraphs.

GaussFitSimple
The GaussFitSimple catalogue (GFS) [13]  In addition to these parameters the catalogue also contains meta-information concerning model fits and corresponding errors.
From the GaussFitSimple catalogue we select amplitudes (AMP *) and sigma (SIG *) of the Gaussian fit for each emission line, as well as calculated fluxes (* FLUX) and equivalent widths (* EW). Here and in the following, the asterisk * is a placeholder for the name of the corresponding emission line. We further include information about the continuum (CONT, GRAD) and the strength of the D4000 break, resulting in 59 selected features. We discard all samples for which a failure of the fitting procedure has been indicated (FITFAIL *), and remove samples containing missing values in any of the feature dimensions.
The resulting sub-catalogue then contains 7430 galaxies with 59 emission line features.
We note that the classification performance on the full catalogue, which contains model fit information and errors / measurement uncertainties is comparable to the results achieved with the reduced catalogue containing 59 features (cf. Section 4). As the selected parameters allow for a more direct interpretation in terms of emission line strengths and therefore facilitate interpretation from the astronomical perspective, we consider the reduced catalogue in the following.

Lambdar
The Lambdar catalogue [14] contains flux measurements and uncertainties for 21 bands, as measured by the LAMBDAR software [14]. When crossreferencing with the catalogue analysed in our preceding study, 400 galaxies are missing from the Lambdar catalogue. These galaxies are removed from the

MagPhys
The MagPhys catalogue [18] contains physical parameters comprising information about stellar populations as well as parameters describing the interstellar medium in the galaxies. Parameters include, among others, star formation rates, star formation time-scales, information about star formation bursts, as well as the masses of stars formed in the bursts, overall stellar ages and masses, metallicities, and information about dust in the interstellar medium and in stellar birth clouds ; all this for each included galaxy. All MagPhys parameters have been derived from information provided in the Lambdar catalogue (Section 2.2) using the MAGPHYS program [18]. Due to missing values in the Lambdar catalogue, the MagPhys catalogue does not contain information for 400 of the galaxies analysed in our ESANN contribution [3]. Apart from these, there are no missing values, so that information from 177 MagPhys features is available for 7541 galaxies. However, after selecting the final sample (cf. Section 2.6) some parameters exhibit almost no variance over the considered samples: Parameters fb17 percentile2 5, fb18 percentile2 5, fb17 percentile16, fb17 percentile50, fb17 percentile84 and fb18 percentile16 1 are largely constant, with maximally 15 data points displaying deviations. We therefore remove these features, which results in a dimensionality of 171 for the final MagPhys sample.
Information on the MagPhys parameter shorthand notation used in the remainder can be found in [19].

Sérsic Catalogues
Three different catalogues are available which contain parameters of single-Sérsic-component fits to the 2D surface brightness distribution of galaxies in different bands [20]. The single-Sérsic-component fits have been produced with the GALFIT program [21]. The catalogues contain a parameter, GALPLAN *, and GALR90 *, the radius containing 90% of total light, measured along the semi-major axis of the galaxy.

SersicCatSDSS
For the SersicCatSDSS catalogue [20], most samples from the crossreferenced catalogue [3,9] are discarded based on the PSFNUM and GALPLAN selection, and only 1672 samples remain. The SersicCatSDSS catalogue is therefore excluded from the analysis.

Classification Scheme
For each galaxy analysed in our ESANN contribution [3], a class label has been determined by astronomers following a visual inspection based classification scheme described by Kelvin et al. [4]. The scheme assigns galaxies to 9 classes: Ellipticals, Little Blue Spheroids, Early-type spirals, Early-type barred Stars -0.005% Table 2: Overview of galaxy classes in the dataset used to cross-reference the catalogues analysed in this paper. Shown are also the corresponding Hubble types, an established galaxy type descriptor in astronomy, and the class index that is used to identify classes in the remainder of the work. Gray highlights indicate the classes that are part of the final classification problems.
As barred spirals, artefacts and stars are highly under-represented in this sample, our subsequent analysis will focus on the substantial classes, namely classes 1, 2, 3, 5 and 7.

Sample selection
To ensure a fair comparison between the catalogues, our final dataset comprises the subsample of galaxies for which a full set of measurements is available, i.e galaxies for which measurements are provided in each of the five considered catalogues. This is the case for 2117 galaxies. Considering only the substantial classes 1, 2, 3, 5 and 7, and balancing classes so that for each class the same number of samples is selected, (259, based on class 2, the class with minimum cardinality), results in a final sample of 1295 galaxies.

GMLVQ
Generalized Relevance Matrix LVQ (GMLVQ) [7,8] is an extension of Learning Vector Quantization (LVQ) [23]. LVQ is a supervised prototype-based method, in which prototypes are annotated with a class label. The prototypes are adapted based on the label information of the training data: if the best-matching unit (BMU), the prototype closest to the data point, is of the same class as a given data point, the prototype is moved towards the data point, while in the case of a BMU with an incorrect class label, the prototype is repelled. While LVQ assesses similarities between prototypes and data points using the Euclidean distance, GMLVQ learns a distance measure that is tailored to the data, allowing it to suppress noisy feature dimensions or to emphasise distinctive features and their pair-wise combinations. GMLVQ therefore considers a generalized distance where Λ is an n × n positive semi-definite matrix, ξ ∈ R n represents a feature vector and w ∈ R n is one of M prototypes. After optimisation, the diagonal of Λ will encode the learned relevance of the feature dimensions, while the offdiagonal elements encode the relevances of pair-wise feature combinations. As empirically observed and theoretically studied [24,25] the relevance matrix after training is typically low rank and can be used, for instance, for visualisation of the data set (see Appendix A for an example).
The parameters {w i } M i=1 and Λ are optimised based on a heuristic cost function, see [7], where P refers to the number of training samples, d Λ J (ξ) = d Λ J (w J , ξ) denotes the distance to the closest correctly labelled prototype w J , and d Λ K (ξ) = d Λ K (w K , ξ) denotes the distance to the closest incorrect prototype w K . If the closest prototype has an incorrect label, d Λ K (ξ i ) will be smaller than d Λ J (ξ i ), hence, the corresponding µ Λ i is positive. Minimisation of E GMLVQ will therefore favour the correctness of nearest prototype classification. In a stochastic gradient descent procedure based on a single example the update reads Derivations and full update rules can be found in [7]. In a batch gradient descent version [26], updates of the form (2) are summed over all training samples.

Random Forests
Random Forests (RF) [10] is a well-known classification and regression method that employs an ensemble of randomised Decision Trees [27]. In randomised Decision Trees, a subset of features is chosen randomly at each node.
Considering only the selected features, decision thresholds are determined based on the best attainable split between classes. To combine the classifications of each tree in the ensemble, i.e. to determine the output of the Random Forest, different methods can be employed. In the scikit-learn implementation used in our experiments [28,29] the final classification output is obtained by averaging the probabilistic prediction of each tree.
Details on the set-up of the experiments for RF as well as for GMLVQ can be found in Section 4.1.

Experiments
In our experiments, we assess relevances of features and discriminability between classes by training and evaluating GMLVQ for each of the five preprocessed catalogues described in Section 2. As found in previous work [3], class 2, the Little Blue Spheroids (LBS), were particularly well-distinguishable. We perform experiments for both, the full 5-class problem, trying to distinguish between galaxy classes 1, 2, 3, 5 and 7 (cf. Table 2) and a 2-class problem in which the LBS are classified against galaxies from the other four classes. In addition to the single catalogue experiments, we also assess feature relevances and discriminability between classes for a concatenation of all catalogues, to account for possible synergies between features from different catalogues.
To allow for interpretation in the light of other classifiers, we perform the same experiments with the widely used Random Forests (RF) classifier [10] as a baseline.

Setup
We train and evaluate GMLVQ on the galaxy catalogue data using a publicly available implementation [26]. As the GMLVQ cost function is implicitly biased towards classes with larger numbers of samples, we train and evaluate the classifier on size-balanced random subsets of the five classes. For our experiments, we specify one prototype per class and run the algorithm for 100 batch gradient steps with step size adaptation as realised in [26] with default parameter settings.the We validate the algorithm by performing a class-balanced repeated random sub-sampling validation (see e.g. [30] for validation methods) The remaining settings and validation procedure remain identical to the 5-class problem.

Random Forests
We execute experiments employing Random Forests analogous to the GM-LVQ experiments, i.e. the classifier is trained on class-balanced random subsets of the data and validated using repeated random sub-sampling validation. Experiments are performed using a publicly available scikit-learn implementation [28,29] with default settings. samples are now classified as belonging to class 2, where this overlap was only 10% for the data analysed in our ESANN contribution [3]. This is also reflected in the 2-class problem when distinguishing the LBS from the other classes. In    Another notable increase in overlap is the overlap between class 5 and 7, where the misclassification rate of class 5 galaxies as class 7 galaxies is increased from 8% to 18%.

Combined catalogues
Combining all catalogues would result in a very high-dimensional classification problem, thereby rendering the resulting relevance profiles difficult to inter-

Random Forests baseline results
The classification accuracies for Random Forests for the individual and combined catalogues are displayed in Figure 1c side-by-side with the GMLVQ results. For all catalogues applying the Random Forest classifier results in comparable, though slightly better classification accuracies.

Discussion & Conclusion
The results presented above suggest that there may be inconsistencies in the investigated morphological classification scheme: Analogous to our previous findings [3], it has proven difficult to distinguish galaxy types using two powerful and flexible classifiers, GMLVQ and Random Forests. In all GMLVQ analyses of the individual as well as of the combined catalogues, class 1 (Ellipticals) and 3 (Early-type spirals) are particularly difficult to differentiate. Class 7 (Late-type spirals & Irregulars) is frequently misclassified as class 5 (Intermediate-type spirals) and with a similar frequency as class 2 (LBS), while class 2 is consistently detected with the highest sensitivity among all classes.
The difficulty of training a successful classifier was also observed in [9], where class-wise averaged accuracies are around 75%. As mentioned in our earlier contribution [3], possible explanations for poor classification performance may be the lack of discriminative power of the employed classifiers or mis-labellings of certain galaxies [9]. A possible indication for the latter case may be that samples from class 7 (Late-type spirals & Irregulars) are often misclassified as class 5 (Intermediate-type spirals), and class 2 (LBS). This indicates that the feature representations of the galaxies in question share more properties with the named classes, and it is not unlikely that in the hand-labelling process an Intermediate-type spiral is occasionally misclassified as class 7 (e.g. confused with a Late-type spiral ), or that a LBS is classified as class 7 (an Irregular ). In the former case, employing even more flexible classifiers, e.g. GMLVQ with local relevance matrices [7], may improve classification performances. In the second case, if mis-labellings are restricted to "neighboring" classes in an assumed underlying class ordering (e.g. when considering class 5 adjacent to class 7, or class 1 (Ellipticals) as adjacent to class 3 (Early-type spirals)), ordinal classification may provide further insights [32,33].
Despite trying to address the issue of essential parameters being not contained in the dataset analysed in [3]  that classification performance is aided by these physical parameters as well.
Further insight into the role of features in the context of necessary and dispensable features may be obtained by studying feature relevance bounds along the lines of [35].

Conclusions.
We have presented an analysis of five galaxy catalogues using Random Forests and GMLVQ, a prototype-based classifier. Analogous to results obtained in preceding work on a lower-dimensional dataset, we conclude that even when considering a multitude of additional galaxy descriptors, the visual-based classification scheme used to label the galaxy sample remains not fully supported by the available data. Taking into account that perceptual and conceptual biases likely play non-negligible roles in the creation and application of galaxy classification schemes, further data-driven analyses might help provide novel insights regarding the true underlying grouping of galaxies.
Acknowledgements. GAMA is a joint European-Australasian project based around a spectroscopic campaign using the Anglo-Australian Telescope. The GAMA input catalogue is The rightmost column of each figure contrasts the eigenvalue spectra of Λ and the data covariance matrix which forms the basis for PCA. While Λ is an n × n matrix, the steeply declining eigenvalue spectra for each dataset illustrate the low-dimensional subspace which GMLVQ operates in after learning [24,25]. In particular, for the 5 class problem, Λ spans an approximately 3 dimensional subspace, while for the 2 class problem the subspace is essentially one-dimensional. The low-rank relevance matrices therefore can be thought of as performing a GMLVQ-intrinsic dimensionality reduction.