REDUCING THE DIMENSIONALITY OF DATA: LOCALLY LINEAR EMBEDDING OF SLOAN GALAXY SPECTRA

Jake Vanderplas; Andrew Connolly

doi:10.1088/0004-6256/138/5/1365

1. INTRODUCTION

With the development of large-scale, astronomical photometric and spectroscopic surveys such as the Sloan Digital Sky Survey (SDSS; York et al. 2000), the Two Micron Survey (2MASS; Skrutskie et al. 2006), the Panoramic Survey Telescope and Rapid Response System (PanSTARRS; Kaiser et al. 2002), and the Canada–France–Hawaii Telescope Legacy Survey (CFHTLS), the amount of data available to astronomers has increased dramatically. While these data sets enable much new science, we face the question of how we extract information efficiently from data sets that are many terabytes in size and that contain a large number of measured attributes. In other words, how do we classify objects in a physically meaningful way when the data span a large number of dimensions and we do not know a priori which of those dimensions is important?

The question of the classification of astronomical data remains open. We can, however, draw upon a number of successful classification schemes, such as Hubble's morphological classification of galaxies and Morgan's spectral classification of stars, and note the attributes they have in common. First, classifications must be simple: Hubble's initial classification into seven different types remains the dominant scheme used in astronomy despite numerous extensions based on subtypes (e.g., de Vaucouleurs 1959; van den Bergh 1976), concentration indices and asymmetry parameters (Abraham et al. 1994), clumpiness (Conselice 2003), Gini coefficients (Abraham et al. 2003), and neural networks (Lahav et al. 1996). Second, they must be physically meaningful: the Hubble morphological scheme relates to the dynamical history of the galaxy. Finally, the number of classes must account for the intrinsic uncertainty in the properties of the observed sources (i.e., we do not want to over-fit the data if the scatter in the classifications is large).

These principles can be applied to spectral data, which have the advantage of being easier to map to physical properties of sources than images. Continuum shapes, emission line ratios, and absorption indices can be used as proxies for star formation rate, stellar age, dust absorption, and metallicity (Heavens et al. 2000; Reichardt et al. 2001). As such, spectral classification, and in particular statistical techniques for spectral classification, is more broadly used than morphological classification. Examples of this include classifications based on color (Cramer & Maeder 1979; Kunzli et al. 1997), spectral indices (e.g., Lick indices; Gulati et al. 1999), model fitting (Bruzual & Charlot 2003), and more general nonparametric techniques (Connolly et al. 1995; Folkes et al. 1996; Yip et al. 2004a; Richards et al. 2009).

The difficulty that spectral classification poses comes from the inherent size and dimensionality of the data. For example, the SDSS spectroscopic survey comprises over 1.5 million spectra, each with 4000 spectral elements covering 3800 Å–9800 Å. Providing a simple and physical classification requires that we reduce the dimensionality of the data in a way that preserves as many of the physical correlations as possible. A number of techniques have been developed. One such approach that has met with success is the Karhunen–Loeve (K–L) transform, or principal component analysis (PCA). PCA is a linear projection of data onto a set of orthogonal basis functions (eigenvectors). It has been applied to galaxy spectra (Connolly et al. 1995; Folkes et al. 1996; Sodre & Cuevas 1997; Bromley et al. 1998; Galaz & de Lapparent 1998; Ronen et al. 1999; Folkes et al. 1999; Yip et al. 2004a; Budavári et al. 2009), QSOs (Francis et al. 1992; Boroson & Green 1992; Yip et al. 2004b) and stellar spectra (Singh et al. 1998; Bailer-Jones et al. 1998) as well as to measured attributes of stars and galaxies (Efstathiou & Fall 1984). The PCA projection for galaxy spectra has been shown to correlate with star formation rate (Connolly et al. 1995; Madgwick et al. 2003c) and is now actively used as a classification scheme in the SDSS and 2dF spectroscopic surveys.

As discussed in this paper, the linearity of the K–L transform, while providing eigenspectra that are relatively simple to interpret, has an underlying weakness. It cannot easily (nor compactly) express inherently nonlinear relations within the data (such as dust obscuration or the variation in spectral line widths). The spectra that have a broad range in line widths require a large number of eigenspectra to capture their intrinsic variance which often results in the continuum emission and emission lines being treated independently (Z. Györy et al. 2009, in preparation). In this paper, we consider a new approach to classification, using a dimensionality reduction technique that preserves the local structure within a data set: local linear embedding (LLE; Roweis & Saul 2000). In Section 2, we will discuss the properties of spectral classification using the PCA and line ratios. In Sections 3 and 4, we introduce LLE and apply it to the classification of spectra from the SDSS. In Section 5, we consider some of the practical details related to the LLE procedure. In Section 6, we discuss the utility of the LLE as a dimensionality reduction scheme as well as its performance relative to classical spectral classification techniques.

2. DIMENSIONALITY REDUCTION AND CLASSIFICATION OF SPECTRAL DATA

2.1. Principal Component Analysis

Over the last decade, the PCA or K–L classification has been applied to a wide range of spectral classification problems. From nonparametric classification (Connolly et al. 1995) to extracting star formation properties (Ferreras et al. 2006) to identifying supernovae within SDSS spectra (Madgwick et al. 2003b; Krughoff et al. 2009), the many applications of PCA have demonstrated the importance of dimensionality reduction in classification. This technique seeks to decompose a data set into a linear combination of a small number of principal components, such that each principal component describes the maximum possible variance:

$\begin{equation} S(\lambda) = \sum _i a_i e_i(\lambda), \end{equation} \tag{ 1 }$

where S(λ) is the spectrum to be projected, a_i are the expansion coefficients, and e_i(λ) are the orthogonal eigenspectra.

Schematically, PCA is equivalent to finding a best-fit linear subspace to the entire data set, such that the variance of the data projected into this subspace is maximized. Yip et al. (2004a) present a robust PCA approach to spectral classification of the SDSS spectra. In it, they find that the information within the SDSS main galaxy sample can be almost completely characterized by the first dozen eigenmodes. This represents a reduction in data size by a factor of nearly 400 over the complete spectra. Furthermore, the coefficients of the first three eigencomponents correlate with physical properties of the galaxies such as star formation rate and post-starburst activity. In fact, PCA decomposition can lead to a single parameter description of galaxy spectra: the mixing angle ϕ between the first and second eigencoefficients. This mixing angle has been shown to correlate well with the galaxy spectral type (Connolly et al. 1995; Madgwick et al. 2003a) and can be used for an approximate separation between quiescent and emission-line galaxies.

For continuum emission, PCA has a proven record in representing the variation in the spectral properties of galaxies. It does not, however, perform well when reconstructing high-frequency structure within a spectrum (i.e., the distribution of emission lines, lines ratios, and line widths). This is due to two effects: principal components preserve the total variance of the system, i.e., the overall color of the spectrum, and principal components express linear relations between components. High-frequency, or local, features do not contribute significantly to the total variance and are, therefore, not represented in the primary eigencomponents. Variations in line widths and, often, line ratios are also inherently nonlinear in nature (due to the impact of dust or variations in the galaxy mass) and are not compactly represented as a linear combination of orthogonal components. This is shown in Yip et al. (2004a), where the majority of galaxies can be represented by three eigenspectra but emission-line galaxies require up to eleven components to express their variance. PCA performs poorly when distinguishing between emission-line galaxies, e.g., separating broad-line QSOs from narrow-line QSOs or star-forming galaxies.

2.2. Line-ratio Diagrams

Line-ratio diagrams, sometimes known as BPT plots (Baldwin et al. 1981) or Osterbrock diagrams (Osterbrock & De Robertis 1985), are one answer to the problems associated with PCA classification. Whereas PCA classifies on the basis of the total variance within a spectrum, which weights the continuum emission more than individual emission lines, line-ratio diagrams are sensitive to emission-line characteristics while ignoring continuum properties. This makes them appropriate for diagnostics that distinguish between galaxies with narrow emission lines. Line-ratio diagrams, however, do not account for continuum emission that might provide additional information on the physical state of a galaxy. They are ill-suited to classify nonemitting galaxies and galaxies with a low signal-to-noise ratio (S/N) for individual lines, and fail completely for galaxies with certain types of emission, e.g., broad-line QSOs (often defined as those with emission line FWHM larger than ∼1200 km s⁻¹). Line-ratio diagrams, then, are useful for the classification of only a small subset of observed objects.

2.3. A Joint Approach to Spectral Classification

Clearly, PCA and line-ratio diagrams serve complimentary purposes. PCA takes into account broad, low-frequency features, and can distinguish galaxies based on their continuum properties. Line-ratio diagrams, by design, depend only on the detailed features of emission lines, and therefore distinguish spectra based on their narrow, high-frequency features. If both types of information could be combined into a joint classification scheme, one that treats the spectrum as a whole rather than trying to isolate individual features, more insight may be gained into the physical state of a given object. This would be particularly helpful for automated classification within spectroscopic surveys, where a single, coherent technique could be used to identify objects that could then be analyzed further by specialized pipelines. The requirements of such a general classification are: it must be sensitive to both low and high frequency spectral information, it must be able to account for nonlinear relationships between the spectral properties, it should be robust to outliers within the data, and it should be able to express the properties of a galaxy in terms of a small number of physically motivated parameters. In this paper, we consider LLE as a solution to these questions.

3. LOCALLY LINEAR EMBEDDING

LLE is a nonlinear dimensionality reduction technique that seeks to find a low-dimensional projection of a higher dimensional data set that best preserves the geometry of local neighborhoods within the data (Roweis & Saul 2000). It can be thought of as a nonlinear generalization of PCA: rather than projecting onto a global linear hypersurface, the projection is to an arbitrary nonlinear surface constrained locally within the overall parameter space (see Figure 1). In this way, it is superficially similar to, e.g., nonlinear PCA and principal curves/manifolds (see Einbeck et al. 2007 for an introduction to and astronomical application of these techniques). These generalizations seek a lower dimensional projection that optimally represents the information contained in each point, such that the original data can be reconstructed from the projection. LLE, on the other hand, seeks a projection that optimally represents the relationship between nearby points. An accurate reconstruction of data is possible to an extent (see Section 5.4), but is not the primary goal of an LLE projection.

**Figure 1.** Canonical "S-curve" test set for nonlinear data reduction. Top: N = 2000 points randomly sampled from a bounded two-manifold (Equation (5)) embedded in three-space. Points are colored to highlight the connection of local neighborhoods. Bottom: the two-dimensional LLE projection of this manifold, with K = 15. Note that points that are clustered in the three-dimensional parameter space are also clustered in the two-dimensional embedding space. This shows that LLE reconstructs the desired embedding.
Download figure:
Standard image High-resolution image

The LLE algorithm consists of two parts: first, a set of local weights is found that parametrize the linear reconstruction of each point from its nearest neighbors; second, a global minimization is performed to determine a lower dimensional analog of the data set that best preserves these local reconstruction weights.

3.1. Description of the LLE Algorithm

For notational conventions, refer to Appendix A. We initially consider a series of points in a high-dimensional space x_i (in the case of SDSS, each spectrum is represented by point in a D_in = 4000 dimensional space) such that the set of these points, X, is given by $\mathbf {X} = [\mathbf {x_0},\mathbf {x_1},\mathbf {x_2},\ldots, \mathbf {x_N}],\ \mathbf {x_i} \in \mathbb {R}^{D_{{\rm in}}}$ . We map this to a lower dimensional space $\mathbf {Y} = [\mathbf {y_0},\mathbf {y_1},\mathbf {y_2},\ldots,\mathbf {y_N}],\ \mathbf {y_i} \in \mathbb {R}^{D_{\rm out}}$ , with D_in>D_out. For each point x_i, let n⁽ⁱ⁾ = [n⁽ⁱ⁾₁, n⁽ⁱ⁾₂, ..., n⁽ⁱ⁾_K]^T be the indices of the nearest neighbors, such that $\mathbf {x}_{n_j^{(i)}}$ is the jth nearest neighbor of x_i. Also, let w⁽ⁱ⁾ = [w⁽ⁱ⁾₁, w⁽ⁱ⁾₂, ..., w⁽ⁱ⁾_K]^T represent the reconstruction weights associated with the K nearest neighbors of x_i.¹

The key assumption is that each point x_i lies near a locally linear, low-dimensional manifold, such that a tangent hyperplane is locally a good fit. If this is the case, a point can be accurately represented as a linear combination of its K nearest neighbors, with K ⩾ D_out. The error in reconstruction can be thought of as the distance from this manifold to the point in question. A convenient form of this error is the reconstruction cost function

$\begin{equation} \mathcal {E}_1^{(i)}(\mathbf {w}^{(i)}) = \Bigg | \mathbf {x_i} - \sum _{j=1}^K w_j^{(i)}\mathbf {x}_{ n_j^{(i)} }\Bigg |^2. \end{equation} \tag{ 2 }$

By minimizing $\mathcal {E}_1^{(i)}(\mathbf {w}^{(i)})$ subject to the constraint

$\begin{equation} \sum _j w_j^{(i)} = 1, \end{equation} \tag{ 3 }$

we find the optimized local mapping. This local mapping has two important properties: first, it is invariant under scaling, translation, and rotation; second, it encodes the relationship between points of the local neighborhood in a way that is agnostic to the dimensionality of the parameter space; i.e., it equally well encodes the properties of the neighborhood in observational parameter space, as well as the properties of the neighborhood as projected onto the local tangent hyperplane. These two facts lie at the core of the LLE algorithm.

Once the weights w⁽ⁱ⁾ are determined for each point x_i, the same weights are used to determine the projected vectors y_i that minimize the global cost function:

$\begin{equation} \mathcal {E}_2(\mathbf {Y}) = \sum _{i=1}^N \Bigg | \mathbf {y_i} - \sum _{j=1}^K w_j^{(i)}\mathbf {y}_{n_j^{(i)}}\Bigg |. \end{equation} \tag{ 4 }$

Note the symmetry inherent in Equations (2) and (4). Intuitively, the fact that the input data X and the projected data Y optimize these linear forms shows that the local neighborhoods of X and Y have similar properties.

Computationally, each of these steps can be implemented efficiently using optimized linear algebra methods (see, for example, Roweis & Saul 2000; de Ridder & Duin 2002). We discuss algorithmic considerations in Appendices B–D.

3.2. Dimensionality Reduction: A Simple Example

We provide a simple and classical example of the performance of the algorithm in Figure 1. The first plot shows a particularly simple nonlinear test set: the "S-curve," with N = 2000 data points in three dimensions uniformly selected from the two-dimensional bounded manifold described by

$\begin{equation} \begin{array}{ll} \left. \begin{array}{l@{}} \dsty x = \sin (\theta)\\ \dsty z = \frac{\theta }{|\theta |}(\cos (\theta)-1))\\ \end{array} \right\rbrace & -1.5\pi \le \theta \le 1.5\pi \\ 0<y<5. & \\ \end{array} \end{equation} \tag{ 5 }$

A linear dimensionality reduction such as PCA would not recover the inherently nonlinear shape of the two-dimensional manifold: it would find three principal axes within the data. An ideal nonlinear reduction would, in effect, unwrap this manifold and map it to a flat plane that preserves local clustering, i.e., the colors would remain unmixed.

The second plot in Figure 1 shows the two-dimensional LLE projection applied to these data, with K = 15. It is clear that LLE, in this case, successfully recovers the desired two-dimensional embedding. Points of similar colors (which are clustered within the higher dimensional space) are clustered in the projected space. As discussed in Section 5.2, the effectiveness of this projection will depend on the size of the region that we consider local (i.e., the number of nearest neighbors) as well as the assumed dimensionality of the embedded manifold.

The above test data densely sample a simply connected manifold. In the case of sparsely sampled, or nonconnected manifolds (e.g., due to missing data and survey detection limits), LLE is less successful at recovering the lower dimensional projection. This issue can be addressed using various modifications to the algorithm (see, for example, Donoho & Grimes 2003; Chang & Yeung 2006; Zhang & Wang 2007), some of which are implemented in the publicly available source code for this work (see Appendix E).

4. AN APPLICATION OF LLE TO SPECTRAL CLASSIFICATION

As a concrete example of the utility of LLE for astronomical classification, we apply LLE to spectra taken from the SDSS DR7 data release (Abazajian et al. 2009). A description of the SDSS can be found in York et al. (2000) with the underlying imaging survey described by Gunn et al. (1998) and the photometric system by Fukugita et al. (1996). The subsample used in this analysis was selected using the technique described in Section 5.3.

This subsample comprises 8711 total spectra from DR7, with 6930 from the main galaxy sample (Strauss et al. 2002) and 1781 from the QSO sample (Schneider et al. 2008), with redshifts z < 0.36. All data have been shifted to a common rest frame, covering a spectral range of 3830 Å–9200 Å, resampled to 1000 logarithmically spaced wavelength bins, corrected for significant sky absorption, and normalized to a constant total flux. Emission-line and absorption-line equivalent widths have been extracted from the DR7 FITS headers. Using these data, we further subdivide these spectra based on the automated classifications of the SDSS spectroscopic pipeline. Of the objects with hydrogen emission line strengths greater than three times the spectral noise, 888 objects are classified as broad emission line QSOs² (where "broad" means a line width of >1200 km s⁻¹), 893 are defined as narrow emission line QSOs and 6068 are defined as emission-line galaxies (emission-line galaxies and narrow-line QSOs are distinguished using the [N ii]/Hα line-ratio diagnostic from Kewley et al. 2001). Of the remaining galaxies, 523 are classified as absorption-line galaxies (Balmer absorption strengths greater than 3σ) and 339 are classified as quiescent galaxies (Balmer emission strengths less than 3σ).

Figure 2 shows the LLE projection of the SDSS spectra from the original 1000 dimensional space to a three-dimensional subspace. To minimize the effect of outliers, we use a robust variant of LLE (see Section 5.1). We plot the position of all of the sources within this subspace and color code the points based on the above divisions. Red points are broad-line QSOs, orange points represent narrow-line QSOs, green points are emission-line galaxies, and light and dark blue points are galaxies with continuum emission and strong absorption line systems, respectively.

The axes of an LLE projection are uniquely defined up to a scaling and a global rotation. The interpretation of the projection comes, therefore, from the relationship between the points rather than from the axes themselves. In Figure 2, it is clear that there are a few well-defined branches in the resulting diagram that separate the sources based on their spectral properties. QSOs (broad and narrow line), emission-line galaxies, and continuum galaxies separate into three basic clusters or classifications. Within these clusters, broad and narrow line QSOs can be distinguished as occupying separate parts of the subspace; emission line galaxies are distinct from absorption line systems while the strong absorption line systems closely track the subspace of the continuum galaxies. The positions of galaxies with low emission and absorption line strengths converge to a common region of the three-dimensional space, which is to be expected as galaxies do not represent a series of individual classifications but rather a continuum of properties.

For the remainder of the paper, all analyses will be done based on the two-dimensional manifold defined by the first and third components (i.e., the larger plot in Figure 2). This subprojection gives maximal discrimination between the QSO and emission galaxy branches (for further discussion of this choice of projection, see Section 5.2).

4.1. Interpretation of the LLE Tracks

The distance of a galaxy along any one of the branches within Figure 2 correlates well with its physical characteristics (as reflected in their spectra). Figures 3–7 show progressions along various regions, along with the mean spectrum of each labeled region. When calculating the mean, we exclude excessively noisy objects, identified as those with flux F(λ) satisfying ∫|F(λ)|dλ>1.5∫F(λ)dλ. Because the spectra are preprocessed with a normalization, the few spectra satisfying this inequality would otherwise contribute disproportionately to the mean spectrum in each region.

Figure 3 shows the progression of spectra along the emission galaxy track. To do this, we calculate the mean spectrum for galaxies contained within the regions associated with the bounding boxes E1–E5. As shown later, this sequence tracks not only increasing line strengths, but also changing line ratios (see Figure 12 later).

Figure 4 shows the progression of spectra along the broad-line QSO track. We find that, as we move from Q1 to Q5, the spectral type of the QSO evolves from a Seyfert 1.9 to a classic broad line QSO. As would be expected for such a transition, the N ii emission line strength decreases, while the Hβ line strength increases as the spectra become progressively dominated by the accretion disk emission. Associated with these emission-line properties we find that the continuum slope of the QSO spectra is bluer for the Q5 class than for Q1. This not only supports the classification of the spectra from Seyfert 1.9 to broad-line active galactic nucleus (AGN), it also demonstrates how LLE can use jointly the information contained within the continuum and line emission.

Figure 5 shows the progression of spectra along the narrow-line QSO track. As with the emission-line galaxy track, the narrow QSO track traces increasing emission from N1 to N5. N1–N3 have narrow Hβ and Hα, with line ratios that indicate power-law ionization, and are spectrally consistent with LINERs. N3–N5 have increasingly broad Hα, which suggests classification as Seyfert 1.9–2.0. Line-ratio diagnostics confirm this (see Figure 12 later).

In Figure 6, we consider the progression of spectra across the different spectral groupings (from broad-line QSOs to galaxies with narrow emission lines) as opposed to along an individual track. The X3 sources are, as before, broad-line QSOs with strong Hβ emission. As we transition to X2, the width of the emission line decreases until X2 has the spectral characteristics of a Seyfert type 1.5. By X1, the spectral properties are consistent with an H ii region or narrow emission line galaxy (for wavelengths <5500 Å) while showing evidence for an AGN component in the broad-line Hα. This one parameter sequence from broad-line QSO to H ii region spectra (including regions where the populations are mixed) is described very naturally with this LLE projection.

The last subsample we consider are the quiescent galaxies in Figure 7. The transition from G1 to G5 tracks the same spectral classification from quiescent to active star formation as found using the PCA analysis of Yip et al. (2004a). In terms of spectral properties, galaxies classified as G1 correspond to the E/S0 spectral type from Kennicutt (1998) and galaxies in G5 correspond to the Sbc/Sc Kennicutt spectrum. Along this progression, the emission-line strengths for Hα and N ii increase and the strength of the Mg iii absorption line decreases. It is also noted that the depth of the break at 4000 Å decreases with increasing index number. Comparisons with the results of Yip et al. (2004a) indicate that this trajectory maps directly to the star formation rate within the galaxy (see Figure 11 later).

From these initial comparisons, it is clear that by projecting onto a three-dimensional space using LLE, we can differentiate between line-emitting galaxies and those dominated by continuum emission. We can also distinguish between sources as a function of their emission-line characteristics. As such, LLE encapsulates both the line emission and continuum emission properties in its classification scheme. That LLE can express the variation in continuum properties of galaxies when projecting from 1000 to 3 dimensions is not altogether surprising as the PCA analyses of Connolly et al. (1995), Yip et al. (2004a), and others, have demonstrated that quiescent galaxies occupy only a small region of the available parameter space and that the dominant direction in this space is governed by the star formation rate. For emission-line galaxies, and in particular QSOs, standard PCA approaches require between 10 and 50 components to express the variation in line strengths as a function of galaxy type. This is due to the inherently nonlinear nature of the reconstruction of emission lines.

The fact that LLE can be used to separate both broad and narrow emission lines into separate groups or families, that the transition between narrow and broad line spectra is smooth, and that the position of a galaxy within each of these groups is dependent on the individual line strengths demonstrates the ability of LLE to concisely represent inherently nonlinear systems in an intuitive manner.

4.2. Separation of Spectral Types

LLE is useful in that it maps a large-dimensional data set to a smaller dimensional parameter space. Within this projected space, many familiar classification schemes can be applied (see Hastie et al. 2001 for a thorough introduction to the topic). In the case of SDSS spectra, LLE succeeds in mapping physically similar spectra to the same region of parameter space. More importantly, physically different spectra are mapped to distinct, well-separated regions, leading to the possibility of a very intuitive classification scheme.

Because of this, we can quickly locate objects that are misclassified by both the SDSS pipeline and the line-diagnostic plots. Figure 8 displays a few examples of these. M1 is classified by the SDSS pipeline as a broad-line QSO, but falls on the emission-line galaxy track. This misclassification is likely due to contamination of the Hα line from the neighboring Nitrogen features, leading to an erroneous large line width. M2 is not recognized as an emitting object, because the Hα feature is missing. In fact, the entire narrow track on which M2 lies consists of objects that look like emission galaxies/type-2 Seyferts, with the Hα line missing due to saturation of the lines, coincident sky lines, and bad pixels. LLE isolates these points on their own track of outliers. M3 and M4 fall on the broad-line QSO track, but are not recognized as broad emitters by the SDSS pipeline, probably due to high background noise. The LLE projection correctly groups these with other strong emitters. Because the LLE projection is entirely neighborhood-based, any given point will have similar properties to nearby points in the projection space.

Next, we compare the classification based on LLE with the classification by more traditional methods, e.g., emission line widths, line strengths, and line ratios. This requires a choice of classification technique for points in the LLE projection space. For illustrative purposes, we create a simple classification scheme and define two well-separated regions that correspond to physically different objects. The "broad-line QSO region" is defined as all points on the broad-line track with LLE component three greater than 0.02. The "emission galaxy region" is defined as all points on the emission galaxy track with LLE component one greater than 0.015 (see Figure 2).

Out of the 624 galaxies in the broad-line QSO region, 31 were classified by SDSS as having narrow line emission, and 11 were classified as quiescent. Through a visual inspection, 29 of the 31 and 1 of the 11 were found to show broad emission features, with the rest of the spectra too noisy to accurately classify. The SDSS pipeline, therefore, misclassifies about 5% of their broad-line sources as narrow line.

Out of the 675 galaxies in the emission galaxy region, 45 have line ratios on the narrow-line QSO side of the Kewley et al. (2001) cutoff, two have Hα emission broader than 1200 km s⁻¹, and four have prominent emission features not detected by the SDSS pipeline.

Combining the above tallies, we can estimate how successfully LLE clusters physically similar spectra. From a subsample that by design contains a high fraction of abnormal spectra (see Section 5.3), less than one percent of the spectra were physically dissimilar from the average population in each region. These errors are primarily due to very noisy spectra. For other cases, e.g., the 45 spectra where classifications based on LLE and Kewley et al. (2001) line diagnostics disagree, it is not immediately clear which classification is correct, if any. The majority of these sources are transitional, falling near the dividing boundary of the line-ratio diagram. This dividing line is an arbitrary analytic separation between two regions of a certain subset of the parameter space. We note, however, that the LLE-based classification is without any a priori training. We do not preselect sources based on known properties nor fine-tune the parametrization of the classification, as opposed to the optimized SDSS pipelines.

Figure 9 shows a sample of some significant outliers identified with LLE. O1–O3 are outliers due to their noise. O4 and O6 appear to be sky spectra that were misclassified based on emission features. O5 appears to be a QSO with its Hα line missing, possibly due to atmospheric absorption. O7 has an artificial feature likely due to faulty sky-noise subtraction. In all of these cases, the spectrum's status as an outlier based on LLE reflects its unusual spectral characteristics.

5. CONSIDERATIONS WHEN APPLYING LLE

5.1. Robust LLE: Dealing with Noisy and Missing Data

As with any classification scheme, it is important that we can account for missing and noisy data in a robust manner. As LLE is a locally weighted projection, it is expected that outliers (even outliers for a single pixel) can impact the definition of the local neighborhood and might bias any classification scheme.

In practice it is difficult to treat this problem completely as it essentially replaces the least squares analysis of each neighborhood with a total least squares/errors in variables analysis for each of the N points. These cannot be solved in a closed form, and require a computationally expensive iterative process (see van Huffel & Vandewalle 1991 for a detailed exploration of the total least squares problem).

We can, however, follow the methodology of Chang & Yeung (2006) and address the problem of outliers using a robust LLE (RLLE) technique. In RLLE, a reliability score for each data point is determined based on its efficacy in predicting the location of the points in its neighborhood. Outliers will have two general properties: first, they will not be members of large number of local neighborhoods; second, within the neighborhoods in which they appear, they will be far from the best-fit hyperplane. With this in mind, reliability scores are calculated using the following method:

1.
For each point, an iteratively reweighted least squares analysis (Holland & Welsch 1977) is performed to determine optimal weights for a weighted PCA reconstruction of the point from its neighbors. Schematically, this is akin to finding the best-fit hyperplane, weighting points based on their distance from that hyperplane, and repeating until the process converges. As a result of this process, each point in a local neighborhood is assigned a local weight that quantifies its contribution to the local tangent space.
2.
Each point may be a member of many neighborhoods. The reliability score for a point is determined by taking the sum of its local weights from each neighborhood of which it is a member.

Thus, a higher score will be given to points that are part of many neighborhoods, and points that contribute more to the reconstruction in each neighborhood. With this in mind, outliers can be identified based on their low reliability score.

In Figure 10, we compare the performance of LLE and RLLE on the S-curve data set from Section 3.2. In this case, we assign 15% of the points within the data set to be randomly selected from the full parameter space. The lower left panel shows the application of the standard LLE approach. As is clear from this figure, outliers can change the local weights of the nearest neighbors and, thereby, distort the resulting lower dimensional manifold; causing the points of different colors to be heavily mixed. Effectively, the random noise within the point distribution fills in enough of the three-dimensional space that LLE cannot extract the lower dimensional manifold. For RLLE, as long as the number of random points is small relative to the number of nearest neighbors, we can isolate and exclude the contaminating points from the local weights. Figure 10 shows that we can recover the underlying two-dimensional manifold in the presence of noise and outliers through this approach.

5.2. Choosing Optimal Parameters

There are three free parameters that must be chosen for LLE: K, the number of nearest neighbors, D_out, the output dimension, and r, the regularization parameter. K and D_out will be discussed here, while r, which depends on details of the implementation, is addressed in Appendix B.

Often in PCA, instead of specifying an output dimension D_out, one chooses to specify a percentage of the sample variance that should be preserved. In LLE, dimensionality can be determined in an analogous way for each local neighborhood (de Ridder & Duin 2002). In the computation of an LLE projection, a variant of the covariance matrix is computed for each local neighborhood. This matrix differs from the covariance matrix used in PCA by only a translation and scaling, neither of which affects the relative magnitudes of the eigenvalues. Taking the lead from PCA, we can specify a percentage of the local variance that we would like to conserve, and thereby find the minimum number of dimensions required locally, simply by doing an eigenanalysis of the local covariance matrix. Taking the mean of the N local dimensionality determinations gives an estimate of the optimal value of D_out for the global projection. For our SDSS analysis, the first three components contain an average of 75% of the local variance. The first ten components contain 92% of the local variance, significantly less than the 99.9% of total variance within the first ten PCA eigenspectra of Yip et al. (2004a). This discrepancy is likely due to noise and intrinsic variation at the local level, which contributes more to the small, local neighborhoods that LLE probes than to the overall variance that PCA probes. Also, our sampling technique (Section 5.3) chooses sources based on dissimilarity, leading to a more intrinsic variation in the subsample.

Choosing the number of nearest neighbors is more difficult. In fact, K need not be constant across all neighborhoods. It could be specified based on local properties of the manifold or the density of sources at a given point. The literature on LLE offers no consensus on how to choose the optimal value of K. Too large a value, and the manifold is oversampled, destroying the assumption of local linearity, and leading to an inherently larger dimensional space. Too low, and the manifold is undersampled, which leads to a loss of information on the relationships between neighborhoods and more sensitivity to the presence of outliers and noise. In this work, we take an empirical approach and try to estimate a series of heuristics that can be applied to the spectral data. We take a range of values for K (10 < K < 30) and determine our ability to maximally distinguish between galaxy populations in the resulting LLE classifications. In the case of galaxy spectra, the K is chosen that maximizes the opening angle between the QSO and emission-line galaxy sequences, measured between any two of the three leading projection dimensions. We find that a value of K = 23 provides the maximal discrimination but that values of K from 20 to 27 are comparable.

It should be noted that the final LLE projection is unique only up to a global translation and rotation. Because of this, the exact nature of the projection is highly dependent on the subset of data chosen, as well as the above-mentioned input parameters. What is invariant, however, is the relationship between each point and its neighbors. In particular, an outlying point in one projection will be an outlier in another projection. This is what makes LLE useful: it allows one to easily visualize the relationships between points in a high-dimensional parameter space.

5.3. Fast LLE: Efficient Sampling Strategies

The LLE algorithm, even when optimized, is computationally expensive. The main bottleneck is the neighbor search, which, for a single point, is $\mathcal {O}[N D_{{\rm in}}]$ . Using brute force for all N points is, at worst, $\mathcal {O}[N^2 D_{{\rm in}}]$ . While more advanced tree-based algorithms can significantly improve on this, it remains a computational hurdle, especially for high-dimensional data sets (see Berchtold et al. 1998 for a review of fast nearest-neighbor algorithms).

The second bottleneck is the computation of the global projection, which involves determining the bottom D_out + 1 eigenvectors of an [N × N] sparse matrix (see Appendix C). By using iterative techniques such as Arnoldi/Lanczos decomposition, it is possible to determine these eigenvectors without performing a full matrix diagonalization (see Wright & Trefethen 2001 for one well-tested optimized algorithm).

Even fully optimized, it is not feasible to learn the LLE projection from the full SDSS spectral sample, which would amount to hundreds of thousands of points in a 4000-dimensional data space. What is more, a random subsample of these data will be too sparsely populated in some regions of space (e.g., for strong line emitters that account for <0.01% of a randomly sample of spectra from the SDSS), and too densely populated in others (e.g., around L* galaxies). In this case, LLE will not sufficiently probe the entire structure of the manifold. For this reason, a strategy is needed to define a subsample that best spans the entire parameter space, without overpopulating the most probable neighborhoods.

To do this, we follow de Ridder & Duin (2002) and appeal to the fundamental assumption of local linearity on the manifold. For a given point x_i, the first step of LLE is to compute the weights required to construct it from its K neighbors. The procedure outlined in Section 5.2 for determining the intrinsic dimensionality can be used to find the local reconstruction error: i.e., the percentage of total variance not captured in the D_out-dimensional projection. Points for which this local projection error is small add very little information to the overall projection. They can be discarded without much effect. Points for which this reconstruction error is large are not well approximated by a linear combination of neighbors, and, therefore, contain information not present in the local best-fit hyperplane. If they are thrown out, the projected manifold will change significantly.

With this in mind, the following strategy was used to reduce the ∼170,000 unique spectra to a manageable 9000:

1.
Divide the sample into groups of 2000 spectra each.
2.
For each group:
- (a)
  find the eigenvalues of the local covariance matrix of each point and create a cumulative distribution based on the local reconstruction error; and
- (b)
  randomly select 20% of the points based on this cumulative distribution, such that those with the largest reconstruction error are preferentially selected.
3.
Merge every five subsamples into a new sample of 2000 spectra.
4.
Repeat this procedure until the total subsample is of the desired size.

5.4. Applying LLE to New Data Sets

Once the projection is determined for a given training sample, it is straightforward to determine the projection of new data onto the training surface. For each new point, the K nearest neighbors are determined within the training data. Weights are determined by minimizing the form of Equation (2). These weights are then used within Equation (4) to determine the projection of the point. Note that this second step amounts to nothing more than a weighted sum of the projected neighbors. This is a key point in the application of LLE: once a projection is defined for a representative training sample, the computational cost of the classification of a set of test-points is largely due to a K-nearest-neighbor search within the training set. This can be done optimally in $\mathcal {O}[D_{{\rm in}}\log N]$ .

Reconstructing unknown data from projected coordinates requires the reverse of the above procedure. Weights are determined by minimizing the corollary of Equation (2) within the projected space, and these weights are used to reconstruct the point in the original data space.

6. DISCUSSION

As shown in Section 4, LLE provides an efficient and automated way of classifying high-dimensional data (in our case galaxy spectra) that can account for inherent nonlinearities within these data. For spectroscopic observations, LLE can jointly classify spectra based on their emission line and continuum properties, which makes it well suited to the task of automated classification for large spectroscopic surveys. In the following section, we compare the output from the LLE classifications with those from standard spectral classification techniques and discuss how LLE might be adopted for the next generation of spectroscopic surveys.

Figure 11 shows the PCA mixing-angle plot using the publicly available Yip et al. (2004a) SDSS galaxy eigenspectra (see Figure 8 in that work). The average spectral sequences from Figures 3–7 are overplotted for comparison. As discussed by Yip et al. (2004a), ϕ_KL correlates with a spectral type (see also Connolly et al. 1995) or star formation rate. It is, however, not well suited to distinguishing between various types of line emission: broad-line QSOs, narrow-line QSOs, and emission-line galaxies overlap the same region of the projection space. This is a reflection of the fact that PCA is sensitive to continuum information rather than to line emission. Accounting for line emission in a PCA framework requires extracting the emission lines and applying PCA to the line equivalent widths (Z. Györy et al. 2009, in preparation).

Figure 12 shows the typical line ratios used to distinguish between star-forming galaxies and narrow-line QSOs. The dividing lines between the populations are due to Kewley et al. (2001). The tracks on the plot correspond to the average spectra in Figures 3 and 5. (Note that line-ratio diagrams do not meaningfully classify broad-line QSOs, which uniformly occupy the entire space of the line-ratio plots.) Kewley et al. (2001) use stellar synthesis models to show that emission characteristics of the objects below the dividing line can be attributed to ionization due to young stellar populations, with the ionization parameter q increasing toward the upper left. This gives a physical interpretation of the emission galaxy (E) sequence of Figure 3: increasing star formation from E1 to E5.

The region above the dividing line, therefore, is due to ionization that cannot be attributed to star formation. Ho et al. (1997) show empirically that this region can be subdivided into transitional objects near the dividing line: LINERs that lie near N1–N2 and Seyfert-2 galaxies that lie near N3–N5. The narrow-line QSO (N) sequence of Figure 5 traces this transition.

The above discussion points to the usefulness of LLE for automatic classification, without requiring the extraction of line parameters from the spectroscopic data. As with most classification schemes, the bulk of the work happens in the training of the LLE. Once the projection parameters (Section 5.2) and training subsample (Section 5.3) are chosen, the resulting projection provides the basis for a unique, repeatable classification scheme, with computational cost dominated by a K-nearest-neighbor search within the training sample (Section 5.4). As such LLE can provide an accurate initial classification and anomaly detection step for any automated spectroscopic pipeline, requiring only the extraction of spectroscopic redshifts.

LLE is not limited to spectroscopic information. In principle, LLE can be used to visualize any type of data, even nonhomogeneous data sets, e.g., a combination of fluxes, redshifts, and angular size. Care must be taken, though, to make sure that nonhomogeneous dimensions contribute equally to the nearest-neighbor search. This can be assured through a modification of the distance metric for the nearest-neighbor search such that the distance between point x and point y is

$\begin{equation} d_{\mathbf {xy}} = \sqrt{ \sum _i w_i(x_i-y_i)^2 }, \end{equation} \tag{ 6 }$

where w_i is a weight associated with measurement dimension i. For example, if the measurements in question are fluxes that range from 0 to 1000 and redshifts that range from 0 to 1, w_i could be set to reflect the parameter range: w_i = [max(x_i) − min(x_i)]⁻¹. This normalization of the data prior to the LLE step is common to any dimensional reduction technique, including PCA.

The computational cost of learning the LLE required that we develop a subsampling technique that preserves the complexity of the underlying data (i.e., as opposed to a random sampling that will be dominated by the typical sources within a population). Neighborhood-based subsampling is important in two key areas: it yields a subsample that fills the entire observational parameter space and the data are uniformly sampled throughout. Furthermore, the local nature of the process means that subtle nonlinear effects are not obscured in the process. This subsampling technique, described in Section 5.3, is broadly applicable, regardless of the classification scheme being employed. It could be applied to any problem that requires sampling of a large data set—such as spectroscopic follow-up of optical or radio-selected galaxies or spectroscopic observations of galaxies for training galaxy photometric redshifts.

7. CONCLUSIONS

The fast classification of large data sets is an increasingly present problem in astronomy. With this paper, we show LLE to be a powerful tool in the classification of SDSS spectra that is sensitive to both low-frequency and high-frequency information. As such, it is an improvement over the standard techniques such as PCA/KL decomposition and line-ratio diagnostics. LLE is still in its infancy, and more understanding is needed to take into account measurement uncertainties, missing data, and undersampled manifolds. We discuss the limitations of LLE due to the computational cost of operating on large data sets and how these can be addressed through intelligent sampling of the training data. Our sampling technique is based on local information content and yields a training subsample that preserves the nonlinear traits of the sample as a whole. As a dimensionality reduction technique, LLE is both fast and accurate, and sensitive to all the information contained in measurement data. Because of this, it has potential to become a useful step in the efficient classification of large astronomical data sets.

LLE software used in this paper is publicly available at http://ssg.astro.washington.edu/software (see Appendix E).

We thank S. Anderson, S. Krughoff, A. Becker, and T. Budavári for helpful comments and conversations. We also thank our referee, C. Bailer Jones, for several helpful suggestions. A. J. C. is partially supported through NSF award 0851007. J. T. V. is partially supported through NASA award NNX07AH07G and NSF award 0851007.

APPENDIX A: NOTATIONAL CONVENTIONS

Throughout this paper, matrices are denoted by bold capital letters (X, Y, etc.), vectors are given by bold lowercase letters (x, y, etc.), while scalar values are in normal-face type (X, y, λ, etc.). Subscripts reference the individual elements of matrices, such that X_ij, is the element in the ith row and jth column of matrix X. The ith column vector of matrix X is given by x_i. Note the potential for confusion with this convention: (x_j)_i = X_ij. To limit this source of confusion, individual elements of a matrix will always be referenced via the latter formulation. I is the square identity matrix, such that I_ij = δ_ij, where δ_ij is the Kroniker delta function. 0 is the zero matrix, such that 0_ij = 0 for all i and j. The sizes of these two matrices should be clear from the context in which they are used. The derivations in the following sections are based on those found in Roweis & Saul (2000) and de Ridder & Duin (2002), with the addition of a few clarifying details.

APPENDIX B: DETERMINING THE OPTIMAL WEIGHTS

The optimal weights w⁽ⁱ⁾ are determined by minimizing the reconstruction error given in Equation (2):

$\begin{equation} \mathcal {E}_1^{(i)}(\mathbf {w}^{(i)}) = \Bigg | \mathbf {x_i} - \sum _{j=1}^K w_j^{(i)}\mathbf {x}_{n_j^{(i)}}\Bigg |^2. \end{equation} \tag{ B1 }$

This can be straightforwardly solved in a closed form, using the method of Lagrangian multipliers. Consider a neighborhood about the point x_i, with weights w_j that satisfy

$\begin{equation} \sum _{j=1}^K w^{(i)}_j = 1. \end{equation} \tag{ B2 }$

This constraint is useful because it imposes translational invariance to the projection. Using this condition, the above sum can be rewritten as

$\begin{equation} \mathcal {E}_1^{(i)}(\mathbf {w}^{(i)}) = \Bigg |\sum _{j=1}^K w_j^{(i)}\big (\mathbf {x_i} - \mathbf {x}_{n_j^{(i)}} \big)\Bigg |^2. \end{equation} \tag{ B3 }$

Now, defining the local covariance matrix

$\begin{equation} C^{(i)}_{jk} = \big(\mathbf {x_i} - \mathbf {x}_{n_j^{(i)}}\big)^T\big(\mathbf {x_i} - \mathbf {x}_{n_k^{(i)}}\big), \end{equation} \tag{ B4 }$

we find

$\begin{equation} \mathcal {E}_1^{(i)}(\mathbf {w}^{(i)}) = \sum _{j=1}^K \sum _{k=1}^K w_j^{(i)} w_k^{(i)}C^{(i)}_{jk}. \end{equation} \tag{ B5 }$

We can minimize this by applying a Lagrange multiplier λ_i to enforce the constraint given in Equation (B2):

$\begin{equation} \mathcal {E}_1^{(i)}(\mathbf {w}^{(i)}) = \sum _{j=1}^K\sum _{k=1}^K w_j^{(i)} w_k^{(i)}C^{(i)}_{jk} + 2\lambda _i \Big (1-\sum _j w_j^{(i)}\Big). \end{equation} \tag{ B6 }$

Minimizing $\mathcal {E}_1^{(i)}(\mathbf {w}^{(i)})$ with respect to w⁽ⁱ⁾ gives the condition of minimization:

$\begin{equation} \mathbf {C}^{(i)} \mathbf {w}^{(i)} = \lambda _i\mathbf {\vec{1}}, \end{equation} \tag{ B7 }$

where $\mathbf {\vec{1}} = [1,1,...,1]^T$ . Defining R⁽ⁱ⁾ to be the inverse of the local covariance matrix C⁽ⁱ⁾, we are left with

$\begin{equation} \mathbf {w}^{(i)} = \lambda _i \mathbf {R}^{(i)} \mathbf {\vec{1}}. \end{equation} \tag{ B8 }$

Now imposing the constraint (B2) leads to

$\begin{equation} \lambda _i = \left(\sum _{j=1}^N \sum _{k=1}^N R^{(i)}_{jk}\right)^{-1}. \end{equation} \tag{ B9 }$

The evaluation of w⁽ⁱ⁾ with Equations (B8) and (B9) requires explicit inversion of the local covariance matrix. In practice, a more efficient method of calculation is to solve a variation of Equation (B7):

$\begin{equation} \mathbf {C}^{(i)}\mathbf {w}^{(i)} = [1,1,1,...,1]^T, \end{equation} \tag{ B10 }$

and then rescale the resulting w⁽ⁱ⁾ such that the constraint in Equation (B2) is satisfied.

In general, the local covariance matrix C⁽ⁱ⁾ may be nearly singular. In this case, the weights w⁽ⁱ⁾ are not well defined. de Ridder & Duin (2002) note that the matrix can be regularized based on its eigenvalues λ⁽ⁱ⁾. Because, in the end, we are doing a global projection to a dimension D_out, we can follow probabilistic PCA and define

$\begin{equation} r = \frac{1}{K-D_{\rm out}}\sum _{j=D_{\rm out}+1}^{K} \lambda _j^{(i)}, \end{equation} \tag{ B11 }$

i.e., the mean of the eigenvalues of the unused eigenvectors, and regularize our matrix as C⁽ⁱ⁾ = C⁽ⁱ⁾ + rI. In practice, this would require an explicit eigenvalue decomposition for each local covariance matrix, which is computationally very expensive. We can obtain a similar regularization parameter by noting that ∑λ_i = tr(C⁽ⁱ⁾). If our D_out-dimensional projection contains the majority of the variance (which, under the assumptions of the algorithm, it should), we see that r will be very small compared with the trace of C⁽ⁱ⁾. For this paper, the value r = (10⁻³)tr(C⁽ⁱ⁾) was used.

APPENDIX C: DETERMINING THE OPTIMAL PROJECTION

The projection error is given by Equation (4):

$\begin{equation} \mathcal {E}_2(\mathbf {Y}) = \sum _{i=1}^N \Bigg | \mathbf {y_i} - \sum _{j=1}^K w_j^{(i)}\mathbf {y}_{n_j^{(i)}}\Bigg |. \end{equation} \tag{ C1 }$

To simplify this, we define a sparse weight matrix W with columns populated by the weight vectors w⁽ⁱ⁾, such that

$\begin{equation} W_{ik} = \left\lbrace \begin{array}{@{}ll} w_j^{(i)} & \mathrm{if\ }k = n_j^{(i)}, \\ 0 & \mathrm{if\ }k \notin \mathbf {n}^{(i)}. \end{array}\right. \end{equation} \tag{ C2 }$

The projection error can then be rewritten in a more compact form:

$\begin{equation} \mathcal {E}_2(\mathbf {Y}) = | \mathbf {Y} - \mathbf {Y}\mathbf {W}|^2. \end{equation} \tag{ C3 }$

It can be shown that this is equivalent to the quantity

$\begin{equation} \mathcal {E}_2(\mathbf {Y}) = \mathrm{tr}(\mathbf {YMY}^T), \end{equation} \tag{ C4 }$

where we have defined

$\begin{equation} \mathbf {M} = (\mathbf {I}-\mathbf {W})(\mathbf {I}-\mathbf {W})^T. \end{equation} \tag{ C5 }$

It is clear that Equation (C4) is trivially minimized by Y = 0. In order to find solutions of interest, we will impose the constraint

$\begin{equation} (\mathbf {YY}^T) = \mathbf {I}. \end{equation} \tag{ C6 }$

That is, the rows of Y are orthonormal. We can then use Lagrangian multipliers λ_i to constrain Equation (C4):

$\begin{equation} \mathcal {E}_2(\mathbf {Y}) = \mathrm{tr}(\mathbf {YMY}^T + {\bLambda }(\mathbf {I}-\mathbf {YY}^T)), \end{equation} \tag{ C7 }$

where ${\bLambda }$ is the diagonal matrix of Lagrangian multipliers, with Λ_ii = λ_i. Minimizing this with respect to Y gives the condition

$\begin{equation} \mathbf {MY}^T - \mathbf {Y}^T {\bLambda } = \mathbf {0}. \end{equation} \tag{ C8 }$

Denoting the rows of Y with the vectors y⁽ⁱ⁾, this can be written in a more familiar form:

$\begin{equation} \mathbf {My}^{(i)} - \lambda _i\mathbf {y}^{(i)} = \mathbf {0}. \end{equation} \tag{ C9 }$

We see that λ_i and y⁽ⁱ⁾ are simply the eigenvalues and eigenvectors of the symmetric-real matrix M. Equation (C8) can be rewritten $\mathbf {YMY}^T = {\bLambda }$ and our reconstruction error (C4) simply is

$\begin{equation} \mathcal {E}_2(\mathbf {Y}) = \mathrm{tr}({\bLambda }) = \sum _i \lambda _i. \end{equation} \tag{ C10 }$

Evidently, the desired projection is contained in the eigenvectors of M corresponding to the smallest eigenvalues. Because of the constraint in Equation (B2), it is clear that y⁽⁰⁾ ∝ [1, 1, 1, ..., 1]^T is a solution with eigenvalue λ₀ = 0. This solution amounts to a translation in space and can be neglected.

Thus, we see that the D_out-dimensional projection Y of a data matrix X is given simply by the (D_out + 1)-dimensional null space of the matrix M = (I − W)(I − W)^T, with the lowest eigenvector neglected.

Fortunately, a full matrix diagonalization is not necessary to find only a few extreme eigenvectors. Iterative methods such as Arnoldi decomposition or Lanczos factorization can be used to determine this null space rather efficiently. What is more, though the matrix I − W is a sparse [N × N] matrix, with only K + 1 nonzero entries per column. Thus, the LLE algorithm has large potential for optimization in both computational and storage requirements.

APPENDIX D: LLE ALGORITHM

Here is a description of the basic LLE algorithm, taking into account the above discussion. Given N points in D_in dimensions, to be projected to D_out dimensions, using the K nearest neighbors of each point:

1.
For i = 1 → N
- (a)
  find the K nearest neighbors of point i;
- (b)
  construct the local covariance matrix, C⁽ⁱ⁾ (Equation (B4));
- (c)
  regularize the local covariance matrix by adding a fraction of its trace to the diagonal: C⁽ⁱ⁾ = C⁽ⁱ⁾ + δ · tr(C⁽ⁱ⁾)I. Choosing δ = 10⁻³ works well in practice; see Appendix B;
- (d)
  determine w⁽ⁱ⁾ by solving Equation (B10); and
- (e)
  use w⁽ⁱ⁾ to populate column i of the weight matrix W, with W_ji = 0 if point j is not a neighbor of point i.
2.
Construct the matrix M = (I − W)(I − W)^T.
3.
Determine the D_out + 1 eigenvectors of the matrix M corresponding to the smallest eigenvalues. The LLE projection is given by these, neglecting the lowest (which is merely a translation).

APPENDIX E: DESCRIPTION OF DIMREDUCE CODE

DimReduce is a C++ code that performs LLE and some of its variants on large data sets. To handle the computationally intensive pieces of the algorithm, it includes a basic interface to the optimized FORTRAN routines in BLAS, LAPACK, and ARPACK.

The DimReduce interface works with data saved in the FITS format, a commonly used binary file format used for astronomical images and data. Four variants of LLE are available:

LLE: This is the basic LLE algorithm. The user must provide input data, the number of nearest neighbors, and either the output dimension or a desired variance to preserve within each neighborhood.

HLLE: This is a variant of LLE that uses a Hessian estimator within each neighborhood that better preserves angles in each neighborhood. The algorithm is based on Donoho & Grimes (2003). Inputs are similar to LLE, with the added restriction that K must be larger than D_in, the input dimension.

RLLE1: This is a robust variant of LLE, based on an algorithm outlined by Chang & Yeung (2006). To detect outliers, an iteratively reweighted PCA is performed on each neighborhood. A reliability score is assigned to each point based on the number of neighborhoods in which it appears, as well as its iteratively determined contribution within each neighborhood. The user provides inputs similar to those of LLE, as well as an reliability cutoff: a number between zero and one that represents the fraction of points believed to be outliers.

RLLE2: This is another robust variant of LLE outlined by Chang & Yeung (2006), where outliers are handled on a neighborhood by neighborhood basis. Reliability scores are computed as described above, and within each neighborhood the nearest K+R nearest neighbors are found, and then the R neighbors with the lowest reliability scores are discarded. The user provides inputs similar to those of LLE, as well as the number R of excess neighbors to find within each neighborhood.

LLEsig: This is a utility routine that can be used in the data selection procedure described in Section 5.3. Given an input matrix, a number of nearest neighbors, and an output dimension, this will compute the local linear reconstruction error for each point. The result is the local reconstruction error for each point.

In addition, a python utility is provided to plot and examine the results, using the open source packages scipy and matplotlib. All of the source code is available on the Web³, along with test data and example scripts.

There are a few possible optimizations to DimReduce that have not yet been implemented. First of all, the sparse weight matrix is presently stored in a dense format. Storage requirements and execution time could be greatly reduced by taking advantage of its sparsity. Secondly, the K-nearest-neighbor search is presently performed using a nonoptimized brute-force algorithm. The use of a ball-tree, cover-tree, or similar algorithm would greatly speed up the performance, especially when projecting new test data onto a training surface. Thirdly, the entire package could be fairly easily restructured to take advantage of parallel computing capabilities. Parallelized versions of BLAS, LAPACK, and ARPACK are all publicly available, and the computation of nearest neighbors and the construction of the weight matrix lend themselves easily to parallelization. The implementation of these improvements, along with the use of the data sampling strategies outlined in Section 5.3, gives LLE the possibility of being used for automated classification of objects in a large variety of contexts.

REDUCING THE DIMENSIONALITY OF DATA: LOCALLY LINEAR EMBEDDING OF SLOAN GALAXY SPECTRA

Article metrics

Permissions

Author affiliations

Dates

ABSTRACT

1. INTRODUCTION

2. DIMENSIONALITY REDUCTION AND CLASSIFICATION OF SPECTRAL DATA

2.1. Principal Component Analysis

2.2. Line-ratio Diagrams

2.3. A Joint Approach to Spectral Classification

3. LOCALLY LINEAR EMBEDDING

3.1. Description of the LLE Algorithm

3.2. Dimensionality Reduction: A Simple Example

4. AN APPLICATION OF LLE TO SPECTRAL CLASSIFICATION

4.1. Interpretation of the LLE Tracks

4.2. Separation of Spectral Types

5. CONSIDERATIONS WHEN APPLYING LLE

5.1. Robust LLE: Dealing with Noisy and Missing Data

5.2. Choosing Optimal Parameters

5.3. Fast LLE: Efficient Sampling Strategies

5.4. Applying LLE to New Data Sets

6. DISCUSSION

7. CONCLUSIONS

APPENDIX A: NOTATIONAL CONVENTIONS

APPENDIX B: DETERMINING THE OPTIMAL WEIGHTS

APPENDIX C: DETERMINING THE OPTIMAL PROJECTION

APPENDIX D: LLE ALGORITHM

APPENDIX E: DESCRIPTION OF DIMREDUCE CODE

Footnotes

REDUCING THE DIMENSIONALITY OF DATA: LOCALLY LINEAR EMBEDDING OF SLOAN GALAXY SPECTRA

Article metrics

Permissions

Share this article

Author affiliations

Dates

ABSTRACT

1. INTRODUCTION

2. DIMENSIONALITY REDUCTION AND CLASSIFICATION OF SPECTRAL DATA

2.1. Principal Component Analysis

2.2. Line-ratio Diagrams

2.3. A Joint Approach to Spectral Classification

3. LOCALLY LINEAR EMBEDDING

3.1. Description of the LLE Algorithm

3.2. Dimensionality Reduction: A Simple Example

4. AN APPLICATION OF LLE TO SPECTRAL CLASSIFICATION

4.1. Interpretation of the LLE Tracks

4.2. Separation of Spectral Types

5. CONSIDERATIONS WHEN APPLYING LLE

5.1. Robust LLE: Dealing with Noisy and Missing Data

5.2. Choosing Optimal Parameters

5.3. Fast LLE: Efficient Sampling Strategies

5.4. Applying LLE to New Data Sets

6. DISCUSSION

7. CONCLUSIONS

APPENDIX A: NOTATIONAL CONVENTIONS

APPENDIX B: DETERMINING THE OPTIMAL WEIGHTS

APPENDIX C: DETERMINING THE OPTIMAL PROJECTION

APPENDIX D: LLE ALGORITHM

APPENDIX E: DESCRIPTION OF DIMREDUCE CODE

Footnotes