Systematic Labeling Bias in Galaxy Morphologies

, , and

Published 2018 November 28 © 2018. The American Astronomical Society. All rights reserved.
, , Citation Guillermo Cabrera-Vives et al 2018 AJ 156 284 DOI 10.3847/1538-3881/aae9f4

Download Article PDF
DownloadArticle ePub

You need an eReader or compatible software to experience the benefits of the ePub3 file format.

1538-3881/156/6/284

Abstract

We present a metric to quantify systematic labeling bias in galaxy morphology data sets stemming from the quality of the labeled data. This labeling bias is independent from labeling errors and requires knowledge about the intrinsic properties of the data with respect to the observed properties. We conduct a relative comparison of label bias for different low-redshift galaxy morphology data sets. We show our metric is able to recover previous de-biasing procedures based on redshift as biasing parameter. By using the image resolution instead, we find biases that have not been addressed. We find that the morphologies based on supervised machine learning trained over features such as colors, shape, and concentration show significantly less bias than morphologies based on expert or citizen-science classifiers. This result holds even when there is underlying bias present in the training sets used in the supervised machine learning process. We use catalog simulations to validate our bias metric and show how to bin the multi-dimensional intrinsic and observed galaxy properties used in the bias quantification. Our approach is designed to work on any other labeled multi-dimensional data set, and the code is publicly available (https://github.com/guille-c/labeling_bias).

Export citation and abstract BibTeX RIS

1. Introduction

For more than a century, astronomers have been working to understand galaxy properties and evolution from their morphology. The seminal example is the Hubble sequence (Hubble 1926), which first classified galaxies into ellipticals, spirals, barred spirals, and irregulars. Galaxy morphologies have been shown to correlate with other intrinsic properties such as color, brightness, maximum rotation velocity, and gas content (Dressler 1980). From these properties, it is possible to infer important physical properties such as stellar population fraction, surface star density, total mass, and gas-to-star conversion rates (see Odewahn et al. 2002, and references therein).

For some time, visual classifications played the dominant role in galaxy morphologies. Classifications have been done by expert astronomers (de Vaucouleurs et al. 1976, 1991; Bundy et al. 2005; Fukugita et al. 2007; Schawinski et al. 2007; Nair & Abraham 2010; Kartaltepe et al. 2015) as well as non-expert citizen scientists through crowdsourcing systems such as Galaxy Zoo (Lintott et al. 2008; Bamford et al. 2009; Lintott et al. 2011; Willett et al. 2013; Simmons et al. 2017; Willett et al. 2017). With the advent of large high-quality survey data like from the Sloan Digital Sky Survey (York & SDSS Collaboration 2000) and CANDELS (Grogin 2011), we are beginning to see more machine learning morphologically classified galaxy data sets using a variety of methods (e.g., Ball et al. 2004; Scarlata et al. 2007; Tasca et al. 2009; Gauci et al. 2010; Huertas-Company et al. 2011; Dieleman et al. 2015; Huertas-Company et al. 2015). Many of the current machine learning classification techniques fall into the category of supervised learning and thus require training data sets, usually based on visually classified morphologies. Examples of unsupervised machine learning classifications can be found in Naim et al. (1997), Edwards & Gaber (2013), Kramer et al. (2013), Shamir et al. (2013), and Schutter & Shamir (2015).

When visually classifying galaxies according to their morphologies, the resulting labels will be biased in terms of observable parameters. Low-resolution and dim galaxy images will be biased toward smoother types, because the human annotator in charge of labeling the images will not be able to see the fine structure of these objects. Bias in galaxy morphology catalogs has been extensively studied by the Galaxy Zoo team (Lintott et al. 2008). In Bamford et al. (2009) and Willett et al. (2013), a bias correction term was applied to morphology probabilities by assuming that the morphological fraction does not evolve over the redshift within bins of fixed galaxy physical size and luminosity. For Galaxy Zoo: Hubble morphologies (Willett et al. 2017) artificially redshifted images have been used to quantify this bias. A different way of addressing the problem is through a machine learning approach, simultaneously learning a classification model, estimating the intrinsic biases in the ground truth, and providing new de-biased labels (Cabrera et al. 2014; Bootkrajang 2016).

In this paper, we present a metric for measuring this labeling bias in morphological classification data sets, and we compare low-redshift morphological catalogs of spiral/elliptical galaxies from experts (Fukugita et al. 2007; Nair & Abraham 2010), non-experts (Lintott et al. 2011), and machine learned (Huertas-Company et al. 2011). We release the public the code for measuring labeling bias and for simulating multi-dimensional labeling bias. This code can be used not only by the galaxy evolution community but also by anyone interested in measuring how biased their catalogs are in terms of observable parameters.

This paper is organized as follows: in Section 2 we develop a statistical measure of labeling bias based on the fraction of objects in terms of their intrinsic and observable properties. Our metric is based upon the assertion that the fractions of labels are fixed within bins on the intrinsic properties. We then quantify variations in labeled fractions from the estimated intrinsic fraction as a function of observed properties. In Section 3, we describe the data sets to be used and how we created simulated galaxy morphology biased data sets. Some considerations on the bias-variance trade-off of our estimators have to be taken in to account. This is explained in Section 4, where we also describe the methodology used to address this issue. In Section 5, we measure the biases for different data sets and show that even "expert labels" are often biased in terms of observed quantities like apparent size. In Section 6, we describe the main conclusions of this work.

2. Classification Bias

In real data it may be very hard to obtain the high-quality true classification labels yi, which we will call the ground truth or gold standard. However, one can always make an estimate of this ground truth ${\hat{y}}_{i}$. In supervised and semi-supervised machine learning, this is usually accomplished through human annotators. In terms of galaxy morphology, the estimated labels stem from visual inspection of galaxy images. These visually determined morphologies are sometimes used directly in scientific analyses. Sometimes, they are used explicitly to train classification algorithms (e.g., supervised learning). Sometimes they are used implicitly to test such algorithms or used in conjunction with unlabeled data (e.g., semi-supervised learning). However, the galaxies are always convolved with the point-spread function (PSF) of the telescope, which makes it difficult to visually (or even computationally) resolve the spiral features in small and faint galaxies. For galaxies, this means that in the estimated labels, spirals can be misclassified as ellipticals. This labeling bias is more important when the PSF is close to the angular size of the galaxies, particularly for ground-based telescope classification, such as Galaxy Zoo. As noted in Bamford et al. (2009) and Cabrera et al. (2014), this bias is not statistical nor inherent to the visual classifiers, but a direct consequence of the quality of the data.

There are many steps that go into the visual classification of the morphologies of galaxies. While we expect classifiers to notice that the light profile is steeper for ellipticals than for disk-like spirals, classifiers also use color and spatial feature identification during their classification process. Depending on the filters used or the resolution of the galaxy image, it is possible to confuse one type of morphology with another. Worse, these mislabelings can be consistent among different human classifiers, leading to a high degree of statistical confidence in the wrong classification label.

In Figure 1 (right panel), we show a spiral galaxy that was classified as elliptical with high confidence in the Galaxy Zoo sample, which is based on ground-based imaging from the Sloan Digital Sky Survey DR7 (Abazajian et al. 2009). On the left, we show a higher-resolution view of this same galaxy from the Hubble Space Telescope, in which one can clearly identify spiral arms. In this example, the spiral arms are washed out by the convolution of the ground-based PSF, rendering their structure undetectable to the human classifiers. It is the projected intrinsic physical scale of the underlying features relative to the PSF that drives the misclassifications.

Figure 1.

Figure 1. Spiral galaxy biased classification. Left: a spiral galaxy with good resolution taken from low earth orbit. Notice the spiral arms. Right: the same spiral galaxy, except at worse resolution and taken from the ground through the Earth's atmosphere. Notice that the arms are no longer discernible. This spiral galaxy was classified by >95% of annotators as being an elliptical, even though higher-quality data prove that it is a spiral.

Standard image High-resolution image

In order to measure the amount of bias in different labeled data sets, we follow Bamford et al. (2009) and use the fractions of objects of each class as a function of the observable parameters that may bias our labels. An example of such parameters for galaxy morphologies is the resolution of the galaxies: high-resolution galaxies are hardly going to be mislabeled, while low-resolution galaxies are more likely to be so. We would expect the fractions of an unbiased data set not to depend on these observable parameters. At the same time, we should also consider intrinsic parameters for which the real fractions of labels will depend on.

Consider a set of intrinsic properties (e.g., physical size, luminosity, or redshift) ${\boldsymbol{\beta }}=\{{\beta }_{1},\cdots ,{\beta }_{{n}_{\beta }}\}$ on which we define ${N}_{{ \mathcal B }}$ multi-dimensional bins ${{ \mathcal B }}_{q}$. Given a set of K labels (e.g., K = 2 for spirals and ellipticals), in each bin ${{ \mathcal B }}_{q}$, we calculate the intrinsic class fraction of objects with each label as fk,q. For typical galaxy morphology data sets, we define ${{\boldsymbol{\beta }}}_{i}\,=({R}_{i},{M}_{i},{z}_{i})$, where Ri is the physical radius (in kpc), Mi is the absolute magnitude, and zi is the redshift for object i. In other words, given a fixed bin q in galaxy physical size, luminosity, and redshift, ${f}_{k=\mathrm{spiral},q}$ defines the intrinsic fraction of spirals compared to the total number of galaxies in bin q.

We then consider the set of observed properties of the objects (e.g., angular size). We define the set of properties ${\boldsymbol{\alpha }}={\{{\alpha }_{j}\}}_{j=1}^{{n}_{\alpha }}$ and create single dimensional bins on each observed property for each of the ${{ \mathcal B }}_{q}$ multi-dimensional bin ${{ \mathcal A }}_{j,l,q}$. Here j defines which property and l defines the range of the bin for that property. For typical galaxy morphological data sets, we define ${{\boldsymbol{\alpha }}}_{i}=({r}_{i}/{\mathrm{PSF}}_{i})$ where ri is the angular size and PSFi is the estimated size of the PSF at the galaxy location in the same units as its angular size.

Note the intrinsic properties are treated in multi-dimensional bins, ${{ \mathcal B }}_{q}$, whereas within each of those bins, the observed properties are treated in individual bins, ${{ \mathcal A }}_{j,l,q}$. This is because our aim is to study the biases with respect to their observed individual properties and so we require at least two bins ($l\in \{1,2\}$) for each observational property. Figure 2 shows a diagram explaining binning in the intrinsic and observable parameters. We start by defining bins ${{ \mathcal B }}_{q}$ in the intrinsic parameters using a kd-tree (see Section 4). For each of these multi-dimensional bins, we bin again in terms of the observable parameters and calculate the fraction of objects in each of these bins for every class.

Figure 2.

Figure 2. Binning on the intrinsic and observed object properties. Top left: we first create multi-dimensional bins, ${{ \mathcal B }}_{q}$, based on the intrinsic properties, such as absolute magnitude (M), physical size (R), and redshift (z). Bottom left: these bin edges are defined using a kd-tree, as explained in Section 4. Right: for the subset of galaxies within each intrinsic bin ${{ \mathcal B }}_{q}$, we measure the fraction of labeled objects fj,l,q,k for each class k as a function of each observable αj. We define equal-sized one-dimensional (1D) bins on the observed properties ${{ \mathcal A }}_{j,l,q}$, where l runs over these bins and we require at least two bins. For an unbiased data set, the fractions within each bin ${{ \mathcal B }}_{q}$ should not change in terms of the observables. When labeling bias is present, the fractions of objects labeled by humans will depend on the observable parameters. We calculate the deviation of fractions from the intrinsic class fraction (Equation (2)) and then define the total labeling bias by summing over all of the properties as in Equation (3).

Standard image High-resolution image

We then calculate the observed class fraction

Equation (1)

where ${N}_{{{ \mathcal A }}_{j,l,q}}$ is the total number of objects with the observed property αj in bin ${{ \mathcal A }}_{j,l,q}$, ${\delta }_{{\hat{y}}_{i},k}$ is the Kronecker delta given an estimate of each galaxy i's classification ${\hat{y}}_{i}$ for class k. The right-hand sides sums over all galaxies that are simultaneously in the observed single property bin ${{ \mathcal A }}_{j,l,q}$ and the intrinsic property multi-dimensional bin ${{ \mathcal B }}_{q}$.

For a given classification k and intrinsic property bin ${{ \mathcal B }}_{q}$, we calculate the 2-Euclidean difference between the observed class fraction fj,l,q,k and the intrinsic class fraction fk,q and sum over all the ${N}_{{{ \mathcal A }}_{j,q}}$ bins ${{ \mathcal A }}_{j,l,q}$ for the observed property αj

Equation (2)

Equation (2) should be ∼0 for large N and when there is no difference between the intrinsic and observed class fractions, i.e., when the classifications are unbiased with respect to an observable.

We can extend this to all classes and intrinsic and observed properties as

Equation (3)

where K is the number of classes (two for the case of elliptical versus spirals). We term Equation (3) the classification bias that quantifies the difference in the observational class fractions with respect to the intrinsic class fractions.

We note that the intrinsic class fraction fk,q can vary for any data set. For instance, a data set designed to represent ellipticals might have an inherently lower spiral fraction than a broader morphological catalog containing spirals, ellipticals, and irregulars. Alternatively, one might be interested in comparing classification algorithms over a wide range of classes and data sets. If so, care has to be taken so that the intrinsic parameters distributions are similar so we do not have any selection effects that could influence fk,q and, in turn, the value of L.

One would hope that the fraction of labels within bins of intrinsic properties, fk,q could in principle be measured using an unbiased ("gold standard") data set or perhaps a subset of the data itself. It is also possible that fk,q could be predicted from theory (e.g., Genel et al. 2014). Here, we take a conservative approach and assume that all observed morphological data sets have some level of bias. We make an estimate ${\hat{f}}_{k,q}$ by using the observed class fraction fj,l,k,q for the bin l in observed property j, which is likely to have the least bias. For example, if we are calculating σj,k,q for αj = r/σPSF, then we calculate ${\hat{f}}_{k,q}$ for the bin which includes the largest values of r/σPSF, since it should contain the least-biased classifications.

Figure 3 shows an example of binning in intrinsic and observable parameters for Galaxy Zoo data. Here, we build the kd-tree splitting the data in terms of z, R, and M creating a three-dimensional (3D) partition of the data. For the data falling in each of the intrinsic bins, we calculate the fraction of spiral and elliptical galaxies as a function of the observable parameter r/σPSF. As r/σPSF decreases (smaller objects), the fraction of spiral galaxies decreases and the fraction of elliptical galaxies increases. In other words, smaller spiral galaxies are confused as elliptical. In order to calculate our bias metric (Equation (3)), we need the intrinsic class fractions fk,q. The least-biased bins in the observable parameters are the ones with the biggest r/σPSF, which we consider as our estimate for the intrinsic class fractions. Figure 4 shows the fractions of spiral and elliptical galaxies in terms of the observable parameter r/σPSF for 26 bins in intrinsic parameters. Independently of the bin in terms of z, R, and M, the fraction of spirals increases with r/σPSF, while the fraction of ellipticals decreases. In order to calculate the data set bias, we use as intrinsic class fractions fk,q the fraction in the bin with a higher r/σPSF, denoted by a dot in Figure 4.

Figure 3.

Figure 3. Binning example for Galaxy Zoo biased data. The kd-tree splits the data in terms of the intrinsic properties z (center top and bottom), R, and M (solid line rectangles). For each of these three-dimensional bins, we calculate the fractions of objects in terms of the observable parameter r/σPSF. As the size of the galaxies diminishes, the fraction of spiral galaxies (dotted lines) decreases, and the fraction of ellipticals (dashed lines) increases. The least-biased bins in observable parameters are the ones with the highest r/σPSF, represented by a dot in the plots. We use these lowest bias bins as our estimation for the intrinsic fractions ${\hat{f}}_{k,q}$.

Standard image High-resolution image
Figure 4.

Figure 4. Fractions in terms of observable parameters for Galaxy Zoo biased data using 26 bins in intrinsic parameters. As the angular size of the galaxies diminishes, the fraction of observed spiral galaxies (dotted lines) decreases, and the fraction of ellipticals (dashed lines) increases due to observational bias. The least-biased bins in observable parameters are the ones with the highest r/σPSF, represented by a dot in the plots, which we use as our estimate for the intrinsic fractions ${\hat{f}}_{k,q}$.

Standard image High-resolution image

3. Data Sets

In this section, we describe the data we use on our experiments. All data considers the r-band from SDSS (Abazajian et al. 2009), and the nine-year WMAP cosmology (Hinshaw et al. 2013) from astropy (The Astropy Collaboration et al. 2018).

3.1. Eyeball Classifications

Fukugita et al. (2007) (hereafter F07) have visually classified 2275 galaxies, each by three experts. They defined a morphological index T such that T = 0, 1, 2, 3, 4, 5, 6 for E, S0, Sa, Sb, Sc, Sd, Im, respectively. In order to measure their bias, we focus on just the elliptical (+S0) galaxies (N = 941) having 0 ≤ T < 2, and the spirals (N = 902) having 2 ≤ T ≤ 5, since the other data sets we compare to only use these two classes. We cross-match these data to the SDSS DR7 to obtain their apparent magnitudes (Petrosian r-band), their apparent sizes (Petrosian r-band radii), and their redshifts.

We also use expert labels from Nair & Abraham (2010, NA10 hereafter) who have visually classified 14034 spectroscopically targeted galaxies from the SDSS. They report T-Types as well as other morphological features such as bars, rings, lenses tails, among others. As with the F07 sample, we focus on elliptical (+S0) galaxies (N = 6276) having −5 ≤ TT < 1 and spirals (N = 7640) having 1 ≤ TT ≤ 8, where TT are their T-Types.

3.2. Galaxy Zoo

We use the Galaxy Zoo 1 data release (Lintott et al. 2011) and their sample with spectra in SDSS, which contains classifications for 667,944 galaxies achieved by crowdsourcing. We define two subsets of the Galaxy Zoo 1: (a) the original biased morphologies (hereafter GZB) and (b) the "de-biased" morphologies (hereafter GZD). The de-biasing procedure used is described in detail in Bamford et al. (2009) and Lintott et al. (2011). Briefly, they assumed that the morphological fraction within bins of fixed galaxy physical size and luminosity does not evolve over the redshift of their data. From that assumption, a bias correction term was estimated in bins of physical size and luminosity and then applied to the original spiral and elliptical classification probabilities. Their algorithm helped motivate our approach to quantify classification bias as described in Section 2.

We cross-match the Galaxy Zoo catalog to the SDSS DR7 to obtain the observed properties, including each galaxy's point-spread function (PSF-determined over the SDSS field). We used the SDSS field-specific psfWidth_r parameter as an estimate of the FWHM for a Gaussian PSF at the location of each galaxy. When galaxies belong to more than one field, we used the galaxy classification and properties pertaining to that with the smallest PSF.

3.3. Supervised Learned Morphologies

Huertas-Company et al. (2011, hereafter HC11) used a support vector machine (SVM) classification model trained over the data set from Fukugita et al. (2007). The HC11 morphologies are probability densities, and so we defined elliptical (+S0s) galaxies as having a probability of being early-type P(Early) ≥ p and spiral galaxies having P(Spiral) ≥ p, where p takes values of 0.5 and 0.8. As with the previous data sets, we cross-match the HC11 data to the SDSS DR7 to ensure that all galaxies in our data sets have the same observed properties and that there are no duplicates.

3.4. Simulated Morphology Catalogs

In order to assess the validity of our method, we created a simulated catalog following the Galaxy Zoo 1 distribution of parameters. We used a kernel density estimation (see Hastie et al. 2009, and references therein) with a Gaussian kernel to estimate the distribution of angular Petrosian radius r, apparent Petrosian magnitude m, redshift z, PSF, and de-biased probabilities randomly choosing 100.000 galaxies from GZ1. Using these parameters, we calculate their physical Petrosian radius R, absolute Petrosian magnitude M, and rPSF = r/σPSF. We consider rPSF and m as the biasing parameters, so we artificially created this bias by changing the labels from spirals to ellipticals with a Gaussian probability depending on these parameters:

Equation (4)

where the probability of modifying a label from S to E depends of a biasing parameter θ, which controls the amount of bias in the data set. The higher the value of θ, the larger the amount of bias. Notice that this added bias is normalized in terms of rPSF and m, by using their median values ${\bar{r}}_{\mathrm{PSF}}$, and $\bar{m}$.

4. Impact of Sampling over the Estimator

Equation (3) is a statistical measure of the classification bias for any data set with K classes and requires bin definitions on the observed properties αj and multi-dimensional data set binning of the intrinsic properties βj. In this section, we examine the effects of how the bins are defined using the simulated morphology catalogs which have varying degrees of bias.

We bin the intrinsic properties of the data using kd-trees. A kd-tree is a data structure for storing a finite set of points from a k-dimensional space. It was examined in detail by Bentley (1975) and Friedman et al. (1977). kd-trees have the benefit of dividing the data into bins for optimal querying performance. They are well characterized in the literature and numerous libraries exist to build such trees. The total number of bins in these trees is 2n, where n is the height of the tree. As n increases, the bins get smaller causing the number of points inside each bin to get smaller too. The dimension of our kd-tree depends on the number of intrinsic properties nβ we are examining. In this effort, we use the absolute magnitude, the physical size, and the redshift for our tree.

For the observed properties, we need to build a grid defining the ranges on each of the αj observational parameters (e.g., such as the resolution limits within the bin). We choose a simple linear binning procedure such that the number of observed galaxies in each bin is equal.

Having defined the bins on the intrinsic and observed properties of each galaxy, as well as the morphological classifications, we examine the robustness of the labeling bias estimator, Equation (3).

4.1. Finite-sampling Bias and Variance

The number of 3D bins on the intrinsic properties, the number of 1D bins in the observable parameters, as well as the total number of objects in each of the bins, combine to impact L. Because real data sets have finite size, the trade-off between bias and variance of our estimators has to be taken into account when defining the binning strategy. We use the simulations to show the impact of the selected binning strategy on our bias metric. Our results are shown in Figure 5 where the left panel is for simulated galaxies following GZD probability distributions (θ = 0.0) and the right panel shows a simulated bias of θ = 1.0.

Figure 5.

Figure 5. Sampling effect over L as calculated over the simulated data sets. Left: unbiased simulations. Right: bias θ = 1.0. Labels indicate the number of bins in the intrinsic parameters obtained from the KD-tree, and the number of bins in the observable parameter: ${N}_{{ \mathcal B }}\times {N}_{{{ \mathcal A }}_{j,q}}$. The variance over fj,l,k,q increases as we diminish the number of objects per bin increasing the value of L. Independently of the binning strategy chosen, our metric obtains a higher value for the biased simulated data set.

Standard image High-resolution image

First, consider a simple fixed binning scenario where we allow the number of galaxies per bin to vary. Figure 5 shows this effect over simulations for different binning strategies. One can see that L decreases as a function of the square root of the total number of galaxies in each bin. There is a point after which adding more galaxies to each bin does not reduce L significantly. We use the shape of this curve to define the optimal number of total galaxies per bin.

Next, consider a fixed number of objects per bin and a fixed number of bins on the intrinsic properties. As one decreases the number of 1D bins in the observable parameters, due to the bias-variance trade-off (see Hastie et al. 2009), there will be a corresponding decrease in the statistical variance for the estimate of the fractions fj,l,k,q, at the expense of increasing the statistical bias. The extreme case is a single bin with very low variance. However, as shown in Figure 2, a single bin in ${{ \mathcal A }}_{j,l,q}$ provides no useful information on the bias we are trying to measure: at least two bins in the observed properties are required in order to track observational bias. Regardless, the decrease in variance (simply due to fewer bins) simultaneously decreases the value of the labeling bias L. This is shown in Figure 5 by considering curves with the same number of intrinsic bins and noting that L is always lowest for the fewest number of observed bins.

On the other hand, if we fix the number of objects per bin as well as the number of bins on the observed properties, then by decreasing the number of bins on the intrinsic parameters, we lose information about the true object fractions, thus causing an increase in the bias of the estimator for the intrinsic class fraction ${\hat{f}}_{k,q}$. This produces an increase of the differences between the observational class fractions and the estimate for the intrinsic class fractions in Equation (2), increasing the value of L. This is shown in Figure 5 by considering curves with equal number of bins in the observables noting that L is always lowest for the highest number of intrinsic bins.

Figure 5 allows us to define a binning procedure for any data set. Notice how the number of objects per bin and binning impacts the value of our bias metric L for data sets with the same amount of simulated bias. Also notice that L is always higher for the data set with higher simulated bias θ for a given binning strategy, which suggests that any binning strategy helps evaluate differences in biases between data sets, as long as enough number of objects per bin are considered.

Since the value of L can vary as a function of the binning, we must be careful to use the same binning procedure when conducting relative comparisons of one or more data sets, even if the binning is not optimal for any specific data set. In practice, when comparing the labeling bias for different sets of data, the binning strategy is defined by the data set with the smallest number of objects.

4.2. Choosing Number of Bins for Real Data

For the data sets in this work, we consider binning strategies that split all parameters (observable and intrinsic) into the the closest number of bins. Given this constraint, we then search for the maximum number of bins such that the running slope of L in Figure 5 is <10−3 for the maximum number of objects per bin allowed by our data set size. Because the real data is noisy, we calculate the mean value of L over 20 bootstrapping sub-samples and considering the same number of bins for each intrinsic and observable parameters. The kd-tree automatically defines the multi-dimensional binning on the intrinsic parameters.

Special care has to be taken when comparing two or more data sets of different sizes. On a larger data set, we may be able to use more bins and/or number of objects per bin, but when comparing it to a smaller data set, this sampling is not going to be feasible to use. In order to make a fair comparison, we need to sample in terms of the smaller data set, so that biases and variance over the distribution of fractions are comparable.

5. Bias for Galaxy Morphologies

Now that we defined how to choose the binning in Equation (3), we measure the classification bias L for the different data sets defined in Section 3. In Section 5.1, we follow the approach proposed by Bamford et al. (2009) and use the redshift as a way to quantify the morphological bias. Then, in Section 5.2, we use r/σPSF as our biasing observable parameters and R, M, and z as intrinsic parameters.

5.1. Redshift as Biasing Parameter

We start by following the approach proposed by Bamford et al. (2009) and consider redshift as a biasing parameter and physical radius and absolute magnitude as intrinsic parameters. The smallest data set is F07 with 1843 spirals and ellipticals. From this, we used the technique described in Section 4.2 to determine the binning. We find the best finite-sampling bias levels for a maximum binning size of eight in the intrinsic parameters and two in the observed parameters. We then apply this binning scheme to all of the data sets to measure the classification bias using Equation (3). For the F07 data set, we obtain 115 galaxies per bin, so we fix this number for all of the other data sets. The data from F07 and NA10 only contains galaxies for m < 16, so in order for the comparison of biases between data sets to be fair, we consider galaxies with m < 16 in the GZ and HC11 data sets. Figure 6(a) shows the bias for different data sets. Notice the standard deviation of L makes it hard to make statistical significant conclusions on the difference between data sets.

Figure 6.

Figure 6. Bias L for different data sets considering z as unique observable parameter and R and M as intrinsic parameters, as proposed by Bamford et al. (2009). The number of galaxies considered for measuring the bias increases from (a) to (c). Error bars show the standard deviation over 100 bootstrapping samples. (a) Using 23 bins in intrinsic parameters, two bins in observable parameters, and 115 galaxies per bin in order to match the total number of galaxies of Fukugita et al. (2007, hereafter F07). (b) Using 24 bins in intrinsic parameters, four bins in observable parameters, and 217 galaxies per bin in order to match the total number of galaxies of Nair & Abraham (2010, hereafter NA10). (c) Using 28 bins in intrinsic parameters, 16 bins in observable parameters, and 58 galaxies per bin in order to match the total number of galaxies in the Galaxy Zoo Biased (GZB) sample Bamford et al. (2009). Note that the machine learning classifications of Huertas-Company et al. (2011, hereafter HC11) use the F07 classifications for training the Support Vector Machine.

Standard image High-resolution image

If we exclude F07, the smallest data set is NA10, with 13916 galaxies. For these data, and using the procedure from Section 4.2, we find a maximum binning size of 24 in the intrinsic parameters and four bins in the observable parameters (217 galaxies per bin). Figure 6(b) shows the results of the bias for each data set. The error bars now allow us to interpret these results with higher significance. The bias for expert annotators of NA10 is similar to that from HC11, and both are smaller than the bias of Galaxy Zoo. Now, we can see that our metric starts to recover the de-biasing procedure from Bamford et al. (2009): the value of L is lower for GZD than for GZB.

If we exclude both F07 and NA10, we can consider galaxies with m > 16. The smallest data set is the Galaxy Zoo biased (GZB) with 237,963 galaxies, so by doing this, we are able to use a larger number of bins, thus having best estimates for L. Using the method described in Section 4.2, we obtain 28 bins for the intrinsic parameters and 16 bins for the observable parameters, from which we can use 58 objects per bin. In Figure 6(c), we show the labeling bias as defined by Equation (3) using this binning strategy for GZB, GZD, and HC11. Now, we can clearly recover the de-biasing procedure proposed by Bamford et al. (2009). The highest values of L are obtained over the GZB data set, while the GZD data set achieves a significantly lower L. This shows that our proposed metric is capable of measuring biases given an assumption of intrinsic and biasing parameters. Again, the lowest labeling bias is obtained for HC11.

5.2. Apparent Radius as Biasing Parameter

As opposed to Bamford et al. who utilized redshift as the parameter for which to characterize and correct labeling bias, in this section we treat the apparent size as the parameter that governs bias. With respect to the PSF, it is the apparent size of a galaxy that will determine whether or not spiral features are washed out to become undetectable. We then include redshift as an intrinsic parameter since we expect it to play a role in the underlying fraction of spirals and ellipticals, which we know to evolve over time (Buitrago et al. 2013; Huertas-Company et al. 2015; Cerulo et al. 2017). There is a concern that apparent size as an observable parameter is degenerate with the combination of the redshift and the physical size for any galaxy. A small nearby galaxy can have the same apparent size as a large and more distant galaxy. However, by also including the absolute magnitude as an intrinsic parameter, this degeneracy is broken. In other words, a small and large galaxy with the same apparent size will never be in the same bin since the small (and thus intrinsically dim) galaxy will appear in a different magnitude bin than a large (and intrinsically bright) galaxy.

We start with the smallest data set F07 with 1843 spirals and ellipticals. We find the best finite-sampling bias levels for a maximum bin number of eight in the intrinsic parameters and two in the observed parameters, obtaining 115 galaxies per bin. Note that this is the minimum bin size we can apply due to the number of intrinsic and biasing (observed in this case) properties in the data. Recall the data from F07 and NA10 only contains galaxies for m < 16, so again we consider galaxies with m < 16 in the GZ and HC11 data sets. Figure 7(a) shows the biases under these assumptions for different data sets. Again, due to the size of the standard deviation error bars of L, it is hard to make statistically significant conclusions on the difference between the data sets.

Figure 7.

Figure 7. Bias L for different data sets considering rPSF as unique observable parameter and R, M, and z as intrinsic parameters. The number of galaxies considered for measuring the bias increases from (a) to (c). Error bars show the standard deviation over 100 bootstrapping samples. (a) Using 23 bins in intrinsic parameters, two bins in observable parameters, and 115 galaxies per bin in order to match the total number of galaxies of Fukugita et al. (2007) (F07). (b) Using 25 bins in intrinsic parameters, three bins in observable parameters, and 144 galaxies per bin in order to match the total number of galaxies of Nair & Abraham (2010) (NA10). (c) Using 28 bins in intrinsic parameters, four bins in observable parameters, and 232 galaxies per bin in order to match the total number of galaxies in the Galaxy Zoo Biased (GZB) sample Bamford et al. (2009). Note that the machine learning classifications of Huertas-Company et al. (2011, hereafter HC11) use the F07 classifications for training the Support Vector Machine.

Standard image High-resolution image

If we exclude F07, we find a maximum binning size of 25 in the intrinsic parameters and three bins in the observable parameters with 144 galaxies per bin. Figure 6(b) shows the results of the bias for each data set. HC11 presents the lowest bias. Expert labels from NA10 are less biased than Galaxy Zoo. With this number of galaxies, there is no statistical significance between GZB and GZD for a given probability threshold.

If we exclude both F07 and NA10, we are able to consider galaxies with m > 16 and use 28 bins for the intrinsic parameters and four bins for the biasing parameter rPSF, from which we obtain 232 objects per bin. In Figure 7(c), we show the labeling bias as defined by Equation (3) using this binning strategy for GZB, GZD, and HC11. The highest values of L are obtained over the GZ-biased data sets, and the lowest labeling bias is obtained for HC11. With this amount of data, we notice that GZD with p > 0.5 shows a smaller amount of bias than GZB. At the same time, by choosing p > 0.8, the selected GZD data set is significantly more biased than the GZD data with p > 0.5 and closer to GZB for p > 0.5. In other words, it appears that the de-biasing procedure implemented in Bamford et al. (2009) for Galaxy Zoo classifications does not work when the vast majority of classifiers agree on the morphological type.

We explore this interesting result further in Figure 8, where we plot the bias L as a function of an increasing Galaxy Zoo classification probability threshold. For the biased sample, we see no clear trend. However, the de-biased sample shows a trend of increasing bias with increasing classification probability threshold.

Figure 8.

Figure 8. Bias L for the Galaxy Zoo Biased sample (GZB) and the Galaxy Zoo De-biased sample (GZD) vs. the probability threshold used to define the classes. Notice that the GZB bias does not significantly decrease with increasing probability threshold and that the GZD bias increases with increasing probability threshold. An explanation for these unexpected trends is discussed in the text.

Standard image High-resolution image

We can explain Figure 8 in the following way. First, Bamford et al. (2009) use a statistical correction (their Equations (A3) and (A4)) that depends on both the raw classification probabilities as well as the intrinsic characteristics of the galaxy (e.g., absolute magnitude, physical Petrosian radius, redshift). This form of correction was chosen under the assumption that at high-classification probabilities, no morphology adjustment should be applied since the labels would be correct (see Figure A9 of Bamford et al. 2009). Thus, the fact that the classification bias is closer to the one from GZB at high p in Figure 8 stems from the design of the classification adjustment formalism. Since the correction term approaches zero at high p, the sample reverts back to the same level of bias inherent in the nominal biased sample.

What is perhaps surprising is that for the GZB sample, the level of bias does not decrease as the classifications reach higher levels of confidence (high p). Recall from the Introduction that our algorithm aims to quantify the presence of classification bias due to mislabeled data. We noted that such mislabeling error is not a statistical labeling error, but instead an intrinsic error related to the quality of data itself (see Figure 1). The high bias at p = 0.5 in the GZB sample is to be expected, especially for spirals, when the data quality is low or when the classifiers are non-expert. As noted earlier, it can be difficult to distinguish between spirals and ellipticals due to the data quality at low brightness or small apparent size. At p > 0.8, it should have been easy for classifiers to have identified morphologies since the classifications from different classifiers agree. This is likely to be true for spirals, but it is nearly impossible for the classifiers to separate ellipticals from spirals when the data quality is bad. When the data is bad, the classifications will always tend toward elliptical with high confidence. In other words, while it is almost certainly the case that p > 0.8 spiral galaxies are true spirals, p > 0.8 ellipticals are not always true ellipticals. Thus, the formalism to adjust classifications for ellipticals should not converge to the raw classification, even at high p.

An alternative approach to correct biased labels is to produce a set of simulated calibration images. These images are degraded versions of high-quality images, where the ground truth labels can be accurately estimated. Galaxy Zoo: Hubble (Willett et al. 2017) labels such images through their interface, producing a set of biased labels with their corresponding ground truth labels. Their correction term allows high-classification probabilities to be adjusted. Measuring biases on such corrected labels would be very interesting, but slightly out of the scope of this paper; here, we present a metric to assess biases and show an application to low-redshift galaxies. We plan to address biases at higher-redshift galaxies in the future, including Galaxy Zoo: Hubble.

The final question regarding Figure 7 is why the machine learning algorithms perform better than the training sets they used? Under perfect conditions, the learned classifications should recover any biases inherent to the input training sets. Recall that HC11 uses an SVM supervised machine learning algorithm that is trained on the F07 data set. However, since the bias in the F07 data is higher than in the HC11 data, we conclude that the supervised machine learning technique used by Huertas-Company et al. (2011) was able to mitigate the biases inherent in their training sets.

It is important to recognize that labeling bias mitigation can only occur if the "correct" choice of observed features is used in the machine learning training sets. The term "correct" simply means that the chosen observable parameters can in fact cause bias and that this bias can be removed with additional truth information. In other words, the bias caused by the apparent sizes can be fixed by leveraging information about the true sizes and absolute magnitudes. As a counter example, one should be able to feed the SVM-tool with an n-dimension set of observed parameters that precisely recover the training set classification (i.e., 100% accuracy to the original training set). In this case, one would have the same level of bias in the SVM trained classifications as in the eyeballed training set. In the particular case of HC11, they used features such as colors, shape, and concentration. These features correlate with morphological types independent of observational parameters, such as resolution. Therefore, the chosen HC11 observed galaxy parameter set is enabling the Fukugita classification biases to be minimized by the complexities of the machine learning algorithm. At the same time, a machine learning model trained over features such as colors will not be able to correctly identify morphologies of outlier galaxies, such as red spirals or blue ellipticals. These galaxies may still be recognized by a human from a relatively high-quality image. Machine learning models and eyeball labels may be complementarily used to obtain scientifically interesting outliers.

6. Conclusions

Observational parameters, such as resolution, can bias the procedure of human labeling of galaxies. We have developed a metric to assess systematic mislabeling of galaxy morphologies that incorporates information about the galaxy intrinsic parameters, such as their true sizes and absolute magnitudes. Our algorithm requires that the true (but unknown) fractions for the classes be constant when binned against their intrinsic parameters. We then quantify the mean deviation of the fraction of objects from the estimated intrinsic fraction in terms of their observational parameters.

We then conduct a relative comparison of labeling bias for expert, citizen-science, and machine learning-based galaxy classifications between spirals and ellipticals (+S0s). We find that, when enough data is provided, the bias in expert labels is statistically lower than the citizen-science labels. We use our metric to recover the Galaxy Zoo de-biasing procedure, under the assumption that labels are biased in terms of the redshift. By using the labeled image resolution as biasing parameters instead, we show our metric is able to find biases that have not been addressed. These biases may be statistically corrected in the future in the same manner that Galaxy Zoo does it. The classifications that use machine learning techniques show the least levels of bias, even when they are trained on biased "gold standards". We conclude that future large-scale morphological classification efforts should employ a combination of human classifications and machine learning in order to minimize labeling bias.

In this paper, we have focused on the problem of galaxy morphologies. However, our approach may be applied to any other labeled data set where intrinsic information can be inferred. We have made our code publicly available so that it can be used by the galaxy evolution community or any other classification problem at https://github.com/guille-c/labeling_bias.5

We wish to thank Nancy Hitschfeld, Benjamín Bustos, Eduardo Vera, Jaime San Martín, Chris Smith, and Alfredo Zenteno for valuable discussion and supporting our project.

G.C.V. gratefully acknowledges financial support from CONICYT-Chile through its FONDECYT postdoctoral grant No. 3160747; CONICYT-Chile and NSF through the Programme of International Cooperation project DPI201400090; Basal Project PFB-03; the Ministry of Economy, Development, and Tourism's Millennium Science Initiative through grant IC120009, awarded to The Millennium Institute of Astrophysics (MAS). C.J.M. was supported by the National Science Foundation under grant No. 1256260. Powered@NLHPC: This research was partially supported by the supercomputing infrastructure of the NLHPC (ECM-02). Most of the table operations and plots were done using TOPCAT (Taylor 2005) and matplotlib (Hunter 2007). We used numpy (Oliphant 2006), scipy (Oliphant 2007), and pandas (McKinney 2010) for numerical computations. The kernel density estimation model was trained using scikit-learn (Pedregosa et al. 2011).

Funding for the SDSS and SDSS-II has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, the U.S. Department of Energy, the National Aeronautics and Space Administration, the Japanese Monbukagakusho, the Max Planck Society, and the Higher Education Funding Council for England. The SDSS Web Site is http://www.sdss.org/.

The SDSS is managed by the Astrophysical Research Consortium for the Participating Institutions. The Participating Institutions are the American Museum of Natural History, Astrophysical Institute Potsdam, University of Basel, University of Cambridge, Case Western Reserve University, University of Chicago, Drexel University, Fermilab, the Institute for Advanced Study, the Japan Participation Group, Johns Hopkins University, the Joint Institute for Nuclear Astrophysics, the Kavli Institute for Particle Astrophysics and Cosmology, the Korean Scientist Group, the Chinese Academy of Sciences (LAMOST), Los Alamos National Laboratory, the Max-Planck-Institute for Astronomy (MPIA), the Max-Planck-Institute for Astrophysics (MPA), New Mexico State University, Ohio State University, University of Pittsburgh, University of Portsmouth, Princeton University, the United States Naval Observatory, and the University of Washington.

Based on observations made with the NASA/ESA Hubble Space Telescope, and obtained from the Hubble Legacy Archive, which is a collaboration between the Space Telescope Science Institute (STScI/NASA), the Space Telescope European Coordinating Facility (ST-ECF/ESA) and the Canadian Astronomy Data Centre (CADC/NRC/CSA).

Footnotes

  • Licensed under the terms of the GNU General Public License v3.0.

Please wait… references are loading.
10.3847/1538-3881/aae9f4