The following article is Open access

High-dimensional Statistical Analysis and Its Application to an ALMA Map of NGC 253

, , , , , , , , and

Published 2024 March 26 © 2024. The Author(s). Published by the American Astronomical Society.
, , Citation Tsutomu T. Takeuchi et al 2024 ApJS 271 44 DOI 10.3847/1538-4365/ad2517

Download Article PDF
DownloadArticle ePub

You need an eReader or compatible software to experience the benefits of the ePub3 file format.

0067-0049/271/2/44

Abstract

In astronomy, if we denote the dimension of data as d and the number of samples as n, we often find a case with nd. Traditionally, such a situation is regarded as ill-posed, and there was no choice but to discard most of the information in data dimensions to let d < n. The data with nd is referred to as the high-dimensional low sample size (HDLSS). To deal with HDLSS problems, a method called high-dimensional statistics has rapidly developed in the last decade. In this work, we first introduce high-dimensional statistical analysis to the astronomical community. We apply two representative methods in the high-dimensional statistical analysis methods, noise-reduction principal component analysis (NRPCA) and automatic sparse principal component analysis (A-SPCA), to a spectroscopic map of a nearby archetype starburst galaxy NGC 253 taken by the Atacama Large Millimeter/submillimeter Array (ALMA). The ALMA map is an example of a typical HDLSS data set. First, we analyzed the original data, including the Doppler shift due to the systemic rotation. High-dimensional PCA can precisely describe the spatial structure of the rotation. We then applied to the Doppler-shift corrected data to analyze more subtle spectral features. NRPCA and R-SPCA were able to quantify the very complicated characteristics of the ALMA spectra. Particularly, we were able to extract information on the global outflow from the center of NGC 253. This method can also be applied not only to spectroscopic survey data, but also to any type of data with a small sample size and large dimension.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

In the past decade, the term big data has been widely recognized and observed in almost all fields of scientific research, and of course also in astronomy. Current astronomical instruments provide us with overwhelmingly large data, providing quantitative information on their physical conditions. The most typical big data in astronomy is the product of large surveys, such as the Sloan Digital Sky Survey (SDSS), 10 the Dark Energy Spectroscopic Instrument Survey, 11 Gaia, 12 the Panoramic Survey Telescope And Rapid Response System surveys, 13 among many others.

We should note, however, that there is another type of big data in astronomy. Detailed observations are often very time-consuming, and it is not easy to map objects to obtain a large number of independently sampled measurements. Such a situation is frequently found in integral field spectroscopy (e.g., UVES at the Very Large Telescope, 14 SAURON on the William Herschel Telescope, 15 SDSS MaNGA data, 16 among others. Another typical example of this type is spectroscopic mapping surveys by radio interferometers (e.g., the Atacama Large Millimeter/submillimeter Array (ALMA)) or by satellite instruments at X-ray, ultraviolet, or mid/far-infrared wavelengths. In such a case, if we denote the dimension in the wavelength (or frequency) with d and the number of samples with n, we often find that nd.

Classical statistical analysis implicitly assumes a sample size n is much larger than data dimension d (nd). As we mentioned, this is not true for many cases in scientific research. Traditionally, in astrophysics, such a situation is regarded as an ill-posed problem, and there was no choice but to discard most of the information in the wavelength direction to let d < n. This was more or less the same for other research fields before 2010. Many researchers have simply given up further analysis, or even did not find a need to deal with such problems before the advent of the big data era. For example, Hand et al. (1994) published a well-known compilation of statistical data, containing more than 500 samples, but most of them satisfied the classical condition of the assumption d < n. However, in the late 1990s, data with dn suddenly started to appear along with the development of information science. One of the epoch-making examples of such kind of research was that of Golub et al. (1999). They used the gene expression monitoring by DNA microarrays applied to human acute leukemias as a test case. The dimension of the DNA microarray was d = 7129 and the sample size n = 72. Statistics in the 1990s could not guarantee the accuracy of the estimation in such a case since it was based on d < n.

Various fundamental problems were known for data with the condition of nd:

  • 1.  
    The sample covariance matrix is usually numerically unstable or does not exist, and classical multivariate analysis cannot work.
  • 2.  
    We meet various types of the curse of dimensionality.
  • 3.  
    Calculation costs are often excessively large.

A method to handle such a problem has been desired since then. In order to study such type of data with a very high dimension and small sample data, we had to push back the boundaries of the traditional technique of multivariate statistical analysis. This is because traditional methods are not able to capture the characteristics of the high-dimensional data space, and hence, not able to make use of the rich information that the data intrinsically contained.

The situation changed drastically in the 2000s. A turning point was introduced in the field of probability theory and theoretical physics based on the random matrix theory. In these studies, the asymptotic behavior of eigenvalues of the sample covariance matrix in the limit as d was explored under Gaussian or Gaussian-type assumptions on the population distribution when d and n increase at the same rate, i.e., n/dc > 0 (c: const.) (e.g., Johnstone 2001; Baik et al. 2005; Baik & Silverstein 2006; Paul 2007). In our current interest, however, the assumptions n/dc > 0 and Gaussian(-type) are actually unrealistically restrictive. A seminal paper by Hall et al. (2005) introduced a new concept, high-dimensional low sample size (HDLSS) asymptotics 17 The limits they studied were taken as the dimension of the data d, while the sample size n was fixed. At that time, this was a revolutionary departure from the traditional asymptotics based on n with d fixed.

Hall et al. (2005) and Ahn et al. (2007) discussed an important representation of the HDLSS data, but still under the assumption of Gaussian(-type). In contrast, Yata & Aoshima (2012) developed a new theory by eliminating this assumption and discovered a completely different type of representation when the Gaussian assumption is not satisfied (see Section 2 for the details). These works were the foundation of research in mathematical statistics on very high-dimensional data analysis, and evolved into an important part of the current trend in the area of research, big data science. This new insight shed light on various areas of study such as genomics, medical research, neuroscience, and image and shape analysis. Then, in their sequence of papers, M. Aoshima, K. Yata, A. Ishii and collaborators further developed a new framework of statistical methodology to tackle this type of problem (e.g., Aoshima & Yata 2011; Yata & Aoshima 2012, 2013; Aoshima & Yata 2014, 2015; Ishii et al. 2016; Aoshima & Yata 2019, and references therein). Readers who are interested in the mathematical background of high-dimensional statistical analysis itself are guided to some reviews, (e.g., Aoshima & Yata 2017; Aoshima et al. 2018).

Despite the great developments in other research fields, this new methodology of statistics seemed to be completely unknown in astronomy. High-dimensional data such as ALMA data are regarded as a sum of informative signals and noise, but we cannot separate them quantitatively without prior information. Supposing that we know the frequencies of emission lines is a kind of such prior information that has been traditionally used in traditional astronomy, but this is not what we prefer. While some frequency regions are considered to contain no information, we proceeded to extract information not by the traditional procedure, but by an objective statistical method without prior information. High-dimensional statistical analysis is one of such objective methods.

Thus, in this work, we introduce the method of high-dimensional statistical analysis for HDLSS data in astronomy. In order to examine the performance of the high-dimensional methods to astronomical data, we apply it to a spectroscopic map of the central region of NGC 253. NGC 253 is a nearby prototypical starburst galaxy (Rieke et al. 1980), and has been observed in various wavelengths. We apply high-dimensional principal component analysis (PCA) to the ALMA spectral map (Ando et al. 2017, hereafter A17). A17 presented an 8 × 5 pc resolution view of the central ∼200 pc region of NGC 253, based on ALMA Band 7 observations. As we introduce in detail, the ALMA maps can be regarded as being typically HDLSS. The spectrum of NGC 253 is very rich in molecular lines (e.g., A17; Martín et al. 2021). Another focus of this work is to classify the spectra full of spectral line features by PCA. The evolution of galaxies is mainly driven by star formation, a transition from interstellar medium (ISM) to stars. Various phases of the ISM are related, and the evolution of the ISM is key to completing the understanding of the galaxy's evolution. Spectroscopic observations are of vital importance to extract and interpret the information of matter in galaxies (e.g., Martín et al. 2021). Then, this classification is potentially a powerful tool for the study of galaxy evolution if it turns out to be useful.

This paper is organized as follows. In Section 2, we introduce the concept of HDLSS data and high-dimensional statistical analysis, with an emphasis on high-dimensional PCA. There, we present some important theorems that characterize the surprising features of HDLSS data, without detailed mathematical proof. Then, in Section 3, we move on to the actual analysis of astronomical data. As already mentioned, we used an ALMA map of NGC 253. We summarize the data properties in this section. Our data analysis is twofold: first, we examine if the high-dimensional statistical analysis is really useful for the analysis of spectroscopic mapping data in Section 4. After confirming its excellent performance in this type of analysis, we further go to a deeper analysis of the NGC 253 molecular lines in Section 5. Section 6 is devoted to the conclusions of this work. Some technical details related to both theoretical and practical methods in this work are presented in the appendices. Appendices A and B describe the properties of the dual sample covariance and its geometric representation. In Appendix C, we present the methodology for non-Gaussian data that was not discussed in the main text. A comparison between the results of traditional and high-dimensional PCAs is shown in Appendix D. The method to correct the Doppler shift is presented in Appendix E.

2. Method

2.1. High-dimensional Statistical Analysis

The HDLSS data have a sample size n and data dimension d with nd. In this work, we mean the newly developed statistical methodology to deal with HDLSS data by high-dimensional statistical analysis.

2.1.1. Unusual Behavior of High-dimensional Statistics

As a pedagogical example, we demonstrate that typical unusual properties only appear in high-dimensional statistical analysis, known as strong inconsistency. We consider a sample mean $\overline{{\boldsymbol{x}}}$,

Equation (1)

Suppose that x i s are drawn from a d-dimensional distribution with a mean μ . Under the usual condition often supposed in traditional statistics, d/n → 0, we have

Equation (2)

where the superscript P stands for the convergence in probability. 18 Equation (2) is referred to as the consistency of the sample average. However, in high-dimensional statistics, d/n, and shockingly we have

Equation (3)

This is the so-called strong inconsistency in the HDLSS framework. What Equation (3) means is that the noise in the observed data drastically increases with the increase of the dimension d. Traditional statistical techniques suppose that the noise is negligible compared to the data information. Hence, traditional statistics can never handle HDLSS data.

In order to have a more concrete image of such counterintuitive phenomena, we examine the behavior of this huge noise in high dimensions. For this, however, some mathematical background is necessary. In the following, we introduce the background mathematical concepts. We do not try to introduce them strictly to try to give rigorous mathematical proof, even if the following sections seem to be rather mathematically inclined.

2.1.2. Dual Sample Covariance

Consider a parent distribution of dimension d, with a mean vector μ = 0 and a covariance matrix ${\tilde{{\rm{\Sigma }}}}_{d}\gt \tilde{O}$ (i.e., all the eigenvalues are positive). We can adopt this constraint for the sake of simplicity. Let λi (i = 1,...,d) be the eigenvalues of $\tilde{{\rm{\Sigma }}}$, and sort them out as

Equation (4)

Then, we can make an eigenvalue decomposition of

Equation (5)

Equation (6)

Here, $\tilde{H}=\left[{{\boldsymbol{h}}}_{1},...,{{\boldsymbol{h}}}_{d}\right]$ is an orthogonal matrix of the corresponding eigenvectors. Note that these are all outrageously huge matrices.

Now we draw a set of n samples from this parent population (d > n), x 1, x 2, ..., x n . Suppose that we describe this set of data by a (d × n) data matrix

Equation (7)

and x i (i = 1, ..., n) are independently and identically distributed (i.i.d.). We should note that the i.i.d. condition is not always satisfied in practice. However, this condition is often required for the rigorous construction of the mathematical methodology. The sample covariance matrix (d × d) is expressed as

Equation (8)

Here, we define the dual sample covariance (n × n),

Equation (9)

A schematic description of a sample covariance matrix $\tilde{S}$ and its dual matrix ${\tilde{S}}_{{\rm{D}}}$ is shown in Figure 1. If we consider the sample eigenvalues of ${\tilde{S}}_{{\rm{D}}}$, its n eigenvalues

Equation (10)

are common with the first n eigenvalues of $\tilde{S}$. See Appendix A for the merit of ${\tilde{S}}_{{\rm{D}}}$.

Figure 1.

Figure 1. Schematic description of the relation between a sample covariance matrix $\tilde{S}$ and its dual matrix ${\tilde{S}}_{{\rm{D}}}$. The top panel describes the sample covariance matrix, and the bottom panel describes the dual matrix, respectively.

Standard image High-resolution image

An important piece of the initial exploration of HDLSS asymptotics in Hall et al. (2005) was the revelation of a surprisingly rigid geometric structure that naturally emerges from the asymptotics in an increasingly random setting, referred to as the geometric representation by them. Recall the eigenvalue decomposition of the covariance matrix $\tilde{{\rm{\Sigma }}}=\tilde{H}\tilde{{\rm{\Lambda }}}{\tilde{H}}^{\top }$. Let

Equation (11)

Then, $\tilde{Z}$ is a (d × n) sphered data matrix from a distribution with the identity covariance matrix. Here, we write $\tilde{Z}={\left[{{\boldsymbol{z}}}_{1},...,{{\boldsymbol{z}}}_{d}\right]}^{\top }$ and ${{\boldsymbol{z}}}_{i}={\left({z}_{i1},...,{z}_{{in}}\right)}^{\top },\ (i=1,...,d)$. Note that ${\mathbb{E}}({z}_{{ij}}{z}_{i^{\prime} j})\,=0\ (i\ne i^{\prime} )$ and ${\mathbb{V}}({{\boldsymbol{z}}}_{i})={\tilde{I}}_{n}$, where ${\mathbb{E}}$ and ${\mathbb{V}}$ denote the expectation and variance of a random variable, and ${\tilde{I}}_{n}$ is the n-dimensional identity matrix. We assume that the fourth moments of each variable in $\tilde{Z}$ are uniformly bounded. Note that if $\tilde{X}$ is Gaussian, the zij s are i.i.d. standard normal random variables. We also define the eigenvalue decomposition of ${\tilde{S}}_{{\rm{D}}}$ by

Equation (12)

Here, u i denotes a unit eigenvector corresponding to ${\hat{\lambda }}_{i}$. See Appendix B for the details of the geometric representations of ${\tilde{S}}_{{\rm{D}}}$. Let ${\hat{{\boldsymbol{h}}}}_{i}$ be the ith eigenvector of $\tilde{S}$. Note that ${\hat{{\boldsymbol{h}}}}_{i}$ can be calculated by

Equation (13)

2.2. High-dimensional PCA

We adopt the same setting as the previous sections for the data distribution: we suppose a random variable with a mean μ = 0 and a covariance matrix ${\tilde{{\rm{\Sigma }}}}_{d}\gt 0$. We assume the following model, referred to as the generalized spiked model, for the eigenvalue distribution of ${\tilde{{\rm{\Sigma }}}}_{d}$,

Equation (14)

where ai (> 0), ci (> 0), and αi (α1 ≥ ... ≥ αm > 0) are unknown constants that preserve the order λ1 ≥ ... ≥ λd , and m is an unknown positive fixed integer. 19 m stands for the number of spiked eigenvalues in the sense of quadratic contribution ratio. The details of the spiked model including the generalized spiked model are considered and given in Yata & Aoshima (2013). The eigenvalues λi (im) describe the intrinsic information of the data, and λi (im + 1) represents the noise. Note that

Equation (15)

i.e., the noise satisfies the sphericity condition (Equation (B1)). We set the sample covariance matrix $\tilde{S}={n}^{-1}\tilde{X}{\tilde{X}}^{\top }$ as above. Here, we define n(d) as the size of a sample depending on d.

2.2.1. Limitation of Traditional PCA

Yata & Aoshima (2009) investigated the properties of traditional PCA when it is applied to HDLSS data, and demonstrated the condition so that the estimation by traditional PCA would have the consistency in the form of n(d) = dγ 20 for a positive constant γ and the limitation of traditional PCA. They have given the following two theorems on the eigenvalues, depending on whether the condition would satisfy

Equation (16)

Yata & Aoshima (2009) I.

Theorem 2.1 If Z satisfies Condition (16), for $i\leqslant m$,

Equation (17)

under the conditions

  • 1.  
    $d\to \infty $ and $n\to \infty $ for i that satisfies ${\alpha }_{i}\gt 1$ .
  • 2.  
    $d\to \infty $ and ${d}^{1-{\alpha }_{i}}/n(d)\to 0$ for i that satisfies ${\alpha }_{i}\in (0,1]$.

See Appendix C for details of the consistency when Z does not satisfy Condition (16).

Theorem 2.1 gives the condition that the estimation by traditional PCA would have consistency. The condition described by d and n in (1) of these theorems is a mild condition in the sense that one can choose n free from d or that n may be much smaller than d such as $n=\mathrm{log}d$. However, in (2), n is heavily dependent on d. Hence, we should be very careful to apply conventional PCA to the HDLSS data. In order to overcome this fundamental problem of conventional PCA, Yata & Aoshima (2010a) and Yata & Aoshima (2012, hereafter YA12) proposed two new PCAs for HDLSS data. In this work, we focus on one of them, the noise-reduction (NR) methodology (YA12).

2.2.2. NRPCA: A New Method for HDLSS Data

Here, we introduce high-dimensional PCA. First, we informally present the concept of high-dimensional PCA, which is schematically described in Figure 2. Because of the enormous noise that concentrates on the sphere in the data space, the intrinsically important characteristics of the data are completely buried under the noise sphere and obscured. However, it is not hopeless to estimate the characteristics even under such a situation, thanks to the fact that the noise behavior around the sphere can be well analyzed and essentially subtracted. The method to concretely realize this idea is the NR methodology proposed by YA12.

Figure 2.

Figure 2. A schematic description of high-dimensional PCA. Even if we have significant important features because of the existence of a huge noise sphere, they are completely embedded in the observed data. High-dimensional PCA can subtract the noise sphere and enable the extraction of the desired features.

Standard image High-resolution image

We try to explain it in a more rigorous manner.YA12 proposed the NR methodology derived from the geometric representation (Equation (B2)). First, we decompose ${\tilde{S}}_{{\rm{D}}}$ as

Equation (18)

The second term represents the noise. Then, from Theorem B.1, under Condition (16), we obtain the following geometric representation for the noise:

Equation (19)

i.e., the noise part becomes deterministic under the HDLSS condition. This result enables us to remove the noise part and obtain the formula

Equation (20)

An estimation of m is given by Aoshima & Yata (2018). We have the following theorem.

YA12 II.

Theorem 2.2 When the components of $\tilde{Z}$ satisfy Condition (16), it holds that for $i\leqslant m$,

Equation (21)

under the conditions that

  • 1.  
    $d\to \infty $ and $n\to \infty $ for i such that ${\alpha }_{i}\gt 1/2$,
  • 2.  
    $d\to \infty $ and ${d}^{1-2{\alpha }_{i}}/n(d)\to 0$ for i such that ${\alpha }_{i}\,\in (0,1/2]$.

If we compare Theorem 2.2 with Theorem 2.1, we see that the consistency holds for ${\check{\lambda }}_{i}{\rm{s}}$ under milder conditions than $\hat{\lambda }$ (under Condition (16)). 21

Next, we apply the NR methodology to the principal component (PC) direction vectors. Define ${\check{{\boldsymbol{h}}}}_{i}\equiv {\left(n{\check{\lambda }}_{i}\right)}^{-1/2}\tilde{X}{{\boldsymbol{u}}}_{i}$, and examine its performance as the estimator of the PC direction vector h i . Note that ${\check{{\boldsymbol{h}}}}_{i}={\left({\hat{\lambda }}_{i}/{\check{\lambda }}_{i}\right)}^{-1/2}{\hat{{\boldsymbol{h}}}}_{i}$. YA12 showed the following theorem.

YA12 III.

Theorem 2.3 Assume ${\lambda }_{i}\ (i\leqslant m)$ has multiplicity one. Under conditions in Theorem 2.2, and the components of $\tilde{Z}$ satisfy Condition (16), it holds that

Equation (22)

We also consider the PC scores. The ith PC score of xi is given by ${{\boldsymbol{h}}}_{i}^{\top }{{\boldsymbol{x}}}_{j}={{\boldsymbol{z}}}_{{ij}}{\left({\lambda }_{i}\right)}^{\tfrac{1}{2}}\equiv {{\boldsymbol{s}}}_{{ij}}$. Let ${{\boldsymbol{u}}}_{i}\equiv {\left({\hat{u}}_{i1},...,{\hat{u}}_{{in}}\right)}^{\top }\ (i=1,...,d)$. Consider we estimate the PC score by ${\hat{{\boldsymbol{u}}}}_{{ij}}{\left(n{\check{\lambda }}_{i}\right)}^{\tfrac{1}{2}}\equiv {\check{{\boldsymbol{s}}}}_{{ij}}$. We define the mean squared error (MSE) of the ith PC score by

Equation (23)

Then, we have yet another important theorem.

YA12 V.

Theorem 2.4 Assume that ${\lambda }_{i}\ (i\leqslant m)$ has multiplicity one. Then, under the same conditions as Theorem 2.3,

Equation (24)

holds.

All these theorems hold for the cases with nonzero mean, by replacing the setting with zero-centered variables. Thus, now we have demonstrated that the NR methodology yields high performance for the estimation of the PC eigenvalues (contributions), PC vectors, and PC scores for HDLSS data. In the following discussion, the presented eigenvalues are obtained by NRPCA.

2.2.3. Automatic Sparse Principal Component Analysis

In this section, we introduce (thresholded) sparse principal component analysis (SPCA) methods for high-dimensional data. SPCA can be summarized as follows: let ${\hat{{\boldsymbol{h}}}}_{i}\,={\left({\hat{h}}_{i(1)},\ldots ,{\hat{h}}_{i(d)}\right)}^{\top }$ for all i. Given a sequence of threshold values ζ > 0, define the thresholded entries as

Equation (25)

Let ${\hat{{\boldsymbol{h}}}}_{i* }={\left({\hat{h}}_{i* (1)},\ldots ,{\hat{h}}_{i* (d)}\right)}^{\top }$. The thresholded estimator of h i is defined by

Equation (26)

However, the estimator heavily depends on a choice of ζ and does not hold consistency properties for HDLSS data. Recently, Yata & Aoshima (2024) proposed an SPCA method by using the NR method as follows: for the PC direction by the NR method, ${\check{{\boldsymbol{h}}}}_{i}={\left({\check{h}}_{i(1)},\ldots ,{\tilde{h}}_{i(d)}\right)}^{\top }$, we assume $| {\check{h}}_{i(1)}| \geqslant \cdots \geqslant | {\check{h}}_{i(d)}| $ for the sake of simplicity. For a given constant ω ∈ (0, 1], we consider PC directions whose cumulative contribution ratio is greater than or equal to ω. There exists a unique integer ${\check{k}}_{i\omega }\in [1,d]$ such that

Equation (27)

We define that

Let ${\check{{\boldsymbol{h}}}}_{i\omega }={\left({\check{h}}_{i\omega (1)},\ldots ,{\check{h}}_{i\omega (d)}\right)}^{\top }$. Yata & Aoshima (2024) called the new PCA method that uses ${\check{h}}_{i\omega (s)}{\rm{s}}$ automatic sparse principal component analysis (A-SPCA). They showed some consistency properties of ${\check{{\boldsymbol{h}}}}_{i\omega }$ in the HDLSS data and investigated the performance in actual data analyses. Here, A-SPCA uses the same eigenvalues estimated by NRPCA. In this paper, we set ω = 1/2. 22 Intuitively, A-SPCA can extract the most important components of the eigenvectors obtained by NRPCA. Namely, these components practically control the PCs.

3. Data

Our target galaxy, NGC 253, is a nearby barred spiral galaxy at a distance of 3.5 Mpc, i.e., 1'' corresponds to 17 pc (Rekola et al. 2005). Morphological type Sc is assigned. It has a systemic heliocentric velocity of 243 ± 2 km s−1 (z = 0.000811: Kamphuis et al. 2015), mainly due to the cosmological expansion. This galaxy is one of the brightest sources of far-infrared emission except for the Magellanic Clouds and has been extensively studied at many wavelengths. It is known as a pure starburst, whose star formation rate (SFR) in the central molecular zone is estimated to be SFR ∼ 2 M [yr−1] (Rieke et al. 1980; Keto et al. 1999; Bendo et al. 2015). Massive, dense, and warm ISM (e.g., Sakamoto et al. 2011) was identified in the central region. Super star clusters are identified at the optical and near-infrared bands (Fernández-Ontiveros et al. 2009). An intense outflow is also observed (Matsubayashi et al. 2009; Bolatto et al. 2013). The super star clusters and outflow are also associated with the starburst activity.

3.1. ALMA Map of NGC 253

The data used in this work is a spectroscopic map of a nearby archetype starburst galaxy NGC 253, obtained by A17, at Band 7. They obtained the 340–365 GHz (λ ≃ 0.85 mm) spectra covering a total frequency range of 11 GHz, compiling the results of two projects in the ALMA Cycle 2 observations: 2013.1.00735.S (PI: Nakanishi) and 2013.1.00099.S (PI: Mangum). The field of view of ALMA at these wavelengths is 16''–17''. The observations cover the frequency range of 340.2–343.4 GHz and 352.5–355.7 GHz for 2013.1.00735.S, and 350.6–352.4 GHz and 362.2–365.2 GHz are covered by 2013.1.00099.S. The uncertainty of the flux calibration in ALMA Band 7 observations is ∼10% according to the ALMA Cycle 2 Proposers Guide. Data reduction is conducted with Common Astronomy Software Applications (CASA; McMullin et al. 2007), versions 4.2.1 and 4.2.2. The image of the central region of NGC 253 was created by using the task clean. The synthesized beam size of the final images is 0''45 × 0''3, which corresponds to 8 × 5 pc at the distance to NGC 253. With a velocity resolution of 5.0 km s−1, the rms noise levels are 1.0–2.4 mJy beam−1. Various emission lines are detected in the spectra (see Figure 2 and Table 1 of A17). The number of spectroscopic resolution unit (sampling number) is 2248.

The A17 original map consists of 864 × 864 points in the spatial domain (R.A and decl.), and 2248 components along the frequency domain (in units of hertz) in total. However, since the pixel scale was smaller than the synthesized beam size of the map 0''45 × 0''3 (oversampled), each spatial grid in the map is not independent of each other. To overcome this issue, we regridded the map in the spatial dimension. The equivalent circular beam size corresponding to the elliptical beam of 0''45 × 0''3 is 0''37. We used this effective beam size to regrid the image map. The resulting map with independent grids consists of 230 × 230 pixels. We then should cut out low-signal regions. For this, instead of estimating the signal-to-noise ratio for some lines, we integrated the intensity over the whole range of frequency and calculated the total intensity I [Jy beam−1] and chose the regions with I > 1.0, based on the rms level. We show this mask region, together with the significantly bright regions in Figure 3. The number of remaining pixels n is 231.

Figure 3.

Figure 3. The bright regions of NGC 253 map cut out by the mask. Left: the mask region map. White regions have significant intensity signals. Center: the cutout region with significantly bright emission. Right: the bird's view of the signal.

Standard image High-resolution image

3.2. Doppler Shifts due to the Systemic Hubble Velocity and Rotation

The central region of NGC 253 is strongly inclined (i ∼ 78°). This causes a fairly coherent rotation pattern on the spectroscopic map through the Doppler shift. We made a Gaussian fit to three prominent emission lines HCN(4–3), HNC(4–3), and CS(7–6) in the spectra to estimate the line center wavelengths. The map of the velocity field is presented in Figure 4. We determined the amount of the shift by averaging the shifts obtained from the three lines. Then, we shifted the frequencies with respect to the systemic velocity of NGC 253 caused by the Hubble velocity. After this correction, we have a map without the systematic effect of the coherent rotation and we can analyze more subtle properties of the submillimeter emission lines.

Figure 4.

Figure 4. Velocity map of NGC 253 estimated from HCN(4–3) (top), HNC(4–3) (middle), and CS(7–6) (bottom). The systemic velocity of 243 km s−1 is subtracted.

Standard image High-resolution image

We demonstrate an example of the emission lines in Figure 22 in Appendix E. We have frequency gaps in our ALMA data cube. If we try to shift the spectrum according to the velocity, the boundaries of the spectra are also shifted. We simply adopted frequency ranges common to all the pixels. The final size of the data is 231 along the spatial position, and 1971 in spectral dispersion after all corrections.

3.3. Spectral Map as HDLSS Data

The resulting map of NGC 253 has spatial dimension 231 × spectral dimension 1971, i.e., n = 231 and d = 1971, and clearly nd. Though it is not as extreme as DNA microarray data (typically n ∼ 10–100 and d ∼ 104–105), our ALMA map is typical HDLSS data. 23 Problems from the astrophysical side are that a very large amount of information is contained in the spectra, and a large variety of spectral lines are observed compared to n. This situation is not special in the high-dimensional analysis.

However, the application of high-dimensional methods to spectroscopic data is not completely straightforward. Spectral lines are broadened and/or shifted by a turbulent motion and other physical mechanisms in the ISM. Then, the information carried by each spectral resolution unit is not independent. 24 Such a complication has never been examined in other fields of science about HDLSS data. However, since it is commonly seen in spectroscopic data in any field of astronomy, before going to the detailed analysis, we first examine if the high-dimensional statistical analysis really works for the astronomical spectroscopic map, with line shifts and broadening.

4. Preparatory Analysis

As we have already mentioned above, there are some potential problems specific to astronomical spectroscopic data, which are different from typical examples of HDLSS data in genomics and other fields. The most fundamental issue is that when we regard spectroscopic data as vector data ${\boldsymbol{x}}={({x}_{1},...,{x}_{d})}^{\top }$, the information on the neighboring frequency (or wavelength) index i and i + 1 (i = 1, ..., d − 1) is not independent. Namely, we often have internal motion in the ISM of a galaxy that leads to a Doppler broadening along the frequency axis. We also find a systemic rotation of the system over the whole observed region of an object (Section 3). In order to examine if high-dimensional statistical analysis would work for such type of astronomical spectroscopy data, we made a preparatory analysis of the original spectral mapping data of NGC 253 with high-dimensional PCA. By the term original, we mean that we started this analysis without correcting the Doppler shift on the map due to the internal systemic rotation. Namely, we blindly applied high-dimensional PCA methods to the data and examined what feature would be detected. For this, we sort the two-dimensional coordinates along the R.A.- direction and then the decl. direction, and renumbered them so that we can regard the data as a one-dimensional vector. Though we can apply PCA directly to a two-dimensional image, we adopted this procedure throughout this paper, as a first simplistic analysis.

4.1. Eigenspectra

In the following part of this work, we mainly focus on the A-SPCA results. However, since NRPCA deals with all the spectral features without extracting only the most important ones, we also present the eigenspectra (eigenvectors constructed from the observed spectra by PCA) from the NRPCA. The upper panel of Figure 5 shows the eigenspectra constructed by NRPCA from the ALMA spectral map of NGC 253. 25 We observe that specific features are commonly found in similar frequency ranges of the spectra for all the PCs. This suggests that some spectral features contain richer information on the whole spectra than other spectral regions.

Figure 5.

Figure 5. The eigenspectra corresponding to the first through the fifth PCs constructed from the original ALMA spectral map of NGC 253. Upper five rows: eigenspectra obtained by NRPCA. Lower five rows: same as upper rows, but obtained by A-SPCA. Only the controlling features have nonzero values.

Standard image High-resolution image

A-SPCA specifies and localizes the most important components that correspond to these particular features. The lower panel of Figure 5 shows the eigenspectra obtained by A-SPCA. As seen in Figure 5, A-SPCA only leaves the values if the features are important to determine the spectra, and forces the other components to be zero. Hence, the A-SPCA eigenspectra are not used for the reconstruction of the spectra with all features, and NRPCA is used for this purpose. We make use of this outstanding advantage of A-SPCA to specify the controlling spectral features of PCA in the following analysis.

4.2. Distribution of the Contribution of PCs

First, we show the eigenvalue distribution of the PCs sorted with their contribution obtained by NRPCA in Figure 6. First, several eigenvalues are prominently larger than the rest. This behavior is referred to as spiked in the context of high-dimensional statistical analysis. Recall that the eigenvalues of the PCA represent the contribution of each PC, namely, the PCA decomposes the whole covariance matrix into contributions from each PC. Since the obtained eigenvalue distribution shows this spiked structure, we see that high-dimensional PCA detected some characteristics of the ALMA spectral map of NGC 253. This guarantees that the current data can be safely regarded as typical HDLSS, and that the assumptions of high-dimensional PCA apply.

Figure 6.

Figure 6. Normalized eigenvalues (contributions) obtained from the analysis of the ALMA spectroscopic map of NGC 253 by NRPCA.

Standard image High-resolution image

Astrophysically, Figure 6 demonstrates a very promising possibility of the high-dimensional statistical analysis of the spectral mapping data. A17 identified more than 30 molecular/radical lines on the map (see their Figure 1), as well as some broad or continuum features. Even if we restrict our interest to the lines, it is implausible that we classify them only by physical intuition. Traditionally, for example, a line ratio has been used to extract physical information from spectral data. However, brute-force trials may not work if we try to diagnose many lines simultaneously because the number of line ratios would increase prohibitively with a number of features, known as a combinatorial explosion. In clear contrast, Figure 6 implies that the whole information of the complicated spectral map can be represented by the first several PCs. This means that the complicated spectral features including more than 30 lines are controlled by several bases represented by the PCs. This advantage itself is a well-known benefit to using PCA in general as a dimensionality reduction method (e.g., Galaz & de Lapparent 1998; Ronen et al. 1999; Wang et al. 2011; Pace et al. 2019; Portillo et al. 2020, among many others), but all these previous applications were on integrated 1D spectra with a sufficiently large number of n, namely, not HDLSS data. We stress that the analysis in this work is substantially different from these works: the current method can be applied to a spectral map, which is a typical HDLSS data set and traditional PCA would not have worked due to the huge noise sphere, as mentioned in Section 2.

Since we confirmed that the methodology of high-dimensional statistical analysis worked well for the astrophysical spectral amp data, we safely proceeded to further analysis by high-dimensional PCA.

4.3. Physical Meaning of the First Two PCs

The next step is to examine what PC1 and PC2 represent. Figure 7 shows the scatter of PC1 and PC2. The most striking feature is the butterfly-like pattern symmetric with respect to the x-axis. It reminds us of the coherent rotation of the central region of NGC 253. We reconstructed the two-dimensional structure to map PC1 and PC2 in Figure 8.

Figure 7.

Figure 7. The distribution of PC1 and PC2 of the ALMA map of NGC 253 by A-SPCA.

Standard image High-resolution image
Figure 8.

Figure 8. The two-dimensional structure to map PC1 and PC2 obtained by A-SPCA. The left panel shows the map of PC1, and the right panel describes the map of PC2.

Standard image High-resolution image

It is clear that PC1 represents the intensity of lines, and PC2 represents the symmetric and coherent pattern of the center of NGC 253. We observe a nice agreement between the lower panel of Figure 8 with the directly estimated velocity field of the same region in Figure 4.

To have a deeper look at the relation, we present the correlation between PCs and observed properties in Figure 9. The upper panel of Figure 9 is a scatter plot between PC1 and the total intensity integrated over the whole frequency range for each pixel. We indeed observe a good correlation between them (Figure 9). The correlation is not linear because PCA only extracts the linear features of the data, while the relation to the physical quantities would not be necessarily linear. We also examined the relation between the peculiar velocity and PC2 in the right panel of Figure 9. As we already presented in a two-dimensional map, PC2 approximately delineates the velocity field in Figure 9. However, we see some outliers where the PC2 and the peculiar velocity are not coherent. These regions may be affected by some local phenomena, for example, a strong outflow. We will revisit this issue later with a detailed analysis.

Figure 9.

Figure 9. The scatter plot between PCs and observed physical properties. Left: the correlation between PC1 and the integrated intensity over the whole frequency range I [Jy beam−1] on the ALMA map of NGC 253. Right: the correlation between PC2 and the peculiar velocity on the ALMA map.

Standard image High-resolution image

4.4. Spectral Features Corresponding to PCs: Original Data

It is interesting to examine what features are concretely responsible for PCs. Actually, A-SPCA can specify the responsible features to characterize the PCs. We show the responsible spectral features in Figure 10. The spectra are integrated over the whole analyzed regions for both figures. In Figure 10, stars represent the features responsible for PC1, and triangles are for PC2. We do not show higher-order contributions for clarity. PC1-related features are assigned to the HCN(4–3) and HNC(4–3) lines, particularly close to the central part of the lines. PC2-related features are related to HCN(4–3), but in contrast to PC1, they are connected to the wing part of the line. Since the spectra are integrated over the analyzed regions, the wing is actually made by the Doppler shifts. This is clearly consistent with the fact that the PC2 map (Figure 8) agrees well with the Doppler velocity field of the NGC 253 map (Figure 4).

Figure 10.

Figure 10. Responsible features to characterize PCs from A-SPCA for the ALMA map of NGC 253. Stars represent the features responsible for PC1, and triangles are for PC2.

Standard image High-resolution image

The performance of A-SPCA to the ALMA spectroscopic map as HDLSS data is certainly guaranteed by this preparatory analysis. We can safely proceed to the detailed analysis of the molecular lines, as in the next section.

5. Results and Discussion

After examining the performance of the high-dimensional statistical analysis in Section 4, now we can proceed to a further physical analysis.

5.1. Application of High-dimensional PCA to the Doppler Shift-corrected Map of NGC 253

We apply high-dimensional PCA to the Doppler-corrected map of NGC 253, exactly in the same manner as we did in Section 4. Note that, though we use the same notation for PC2, now PC2 is the value obtained from the Doppler-corrected map and different from those in Section 4 from NRPCA. We should also note that since we applied NRPCA and A-SPCA to the Doppler-corrected map, all the contributions of PCs are affected by the correction.

The same as the preparatory analysis, we first show the eigenspectra of the Doppler-corrected spectral map in Figure 11. Compared with Figure 5, the meaning of the shape of each eigenspectrum becomes clearer. We stress that since we have already subtracted the global shift of the whole spectra, it does not represent the systemic velocity of the position anymore. Thus, the velocity structure is always the first-order deviation from the systemic rotation pattern, and appears as a broadening or asymmetric distortion of the line profiles. Another important difference is that an important spectral line is identified in PC5 (see PC5 in the bottom panel in Figure 11), which was not specified in the original data. This is because the dominant effect of the systemic rotation was successfully removed from the map, and more subtle features appear more prominent in the Doppler-corrected map. Now very clearly PC2 represents the wing of the HCN line and some other lines, indicating that it reflects a very local expansion of the gas. PC3 and PC4 describe the blueward distortion of the lines, caused by a gas motion toward us faster than the expansion velocity. Though PC5 is less prominent, but similarly to PC3 and PC4, it represents the redward distortion of the lines. To examine the effect of the choice of ω, we also present the eigenspectra with ω = 0.25 and ω = 0.75 in Figure 12.

Figure 11.

Figure 11. The eigenspectra corresponding to the first through the fifth PCs constructed from the Doppler-corrected ALMA spectral map of NGC 253. The upper five rows present the eigenspectra obtained by NRPCA, and the lower five rows show that obtained by A-SPCA.

Standard image High-resolution image
Figure 12.

Figure 12. The eigenspectra corresponding to the first through the fifth PCs constructed from the Doppler-corrected ALMA spectral map of NGC 253. The upper five rows present the eigenspectra obtained by A-SPCA with 0.25, and the lower five rows show that obtained by A-SPCA with 0.75.

Standard image High-resolution image

We present the distribution of the contribution of PCs in Figure 13. Now the contribution of new PC2 is much smaller than old PC2 in Figure 6. This is also because of the successful removal of the systemic rotation. Instead, we have a smaller but nonnegligible contribution of a new PC2. Now PC2 would describe much more subtle and local features of the spectral line map of NGC 253, which is consistent with the implications from the eigenspectra.

Figure 13.

Figure 13. Normalized eigenvalues (contributions) obtained from the analysis of the ALMA spectroscopic map of NGC 253 by NRPCA, after correcting the effect of the Doppler shifts of the systemic rotation.

Standard image High-resolution image

This is more drastically represented on the scatter plot of PCs. We demonstrate the relations between PCs in Figures 1416. Figure 14 shows the two-dimensional distribution of PC1 and PC2. The butterfly-like pattern seen in Figure 7 completely disappeared. Figures 15 and 16 are the relations between PC1 and PC3, and PC2 and PC3, respectively. Again we do not find any symmetric pattern in Figures 15 and 16. From these figures, we do not have an obvious symmetric pattern in the intrinsic spectral features in the map. Both in 2D and 3D, we observe a continuous distribution among PCs.

Figure 14.

Figure 14. The two-dimensional structures of the distribution of PC1 and PC2 of the NGC 253 obtained by A-SPCA, after correcting the effect of the Doppler shifts of the systemic Bottom: same as the top panel, but with PC1 and PC3. Note that the PC2 in this figure is different from that in Figure 8.

Standard image High-resolution image
Figure 15.

Figure 15. Same as Figure 14, but for PC1 and PC3.

Standard image High-resolution image
Figure 16.

Figure 16. Same as Figure 14, but for PC2 and PC3.

Standard image High-resolution image

5.2. Map of PCs of the NGC 253 Spectra

We reconstructed the two-dimensional structure to map PC1, PC2, and PC3 in Figure 17. Again, PC1 represents the intensity of lines. In contrast to Figure 8, new PC2 shows a more complicated, localized pattern in NGC 253. The map of PC3 also shows smaller-scale structures. Recalling that PC2 suggests the existence of small-scale outflow. According to A17, the spatial scale (diameter) of the most prominent region with large PC2 is ∼10 pc, namely, it corresponds to a small-scale phenomenon, indeed.

Figure 17.

Figure 17. The two-dimensional structure to map PC1, PC2, and PC3 obtained by NRPCA.

Standard image High-resolution image

Particularly interesting is the region with negative PC3 in the bottom panel of Figure 17. The eigenspectra show that PC3 and partially PC4 describe a local but larger-scale flow toward us, which appears in the blueshift of the lines. Bolatto et al. (2013) reported a global molecular outflow from the center of the radio continuum. More recently, Walter et al. (2017) and Krieger et al. (2019) performed detailed ALMA observations of this region. They made a detailed analysis and named this outflow the SW streamer. This outflow is going out toward us, and the position of the root of the SW streamer precisely agrees with the region with negative PC3 in Figure 17 (Krieger et al. 2019, see their Figure 1). The anomalous region specified by the peculiarity in velocity in Figure 18 also agrees well with the root of the SW streamer. In addition to the molecular observations, an outflow is also observed in Hα and other optical emission lines by Fabry–Perot spectroscopy (Matsubayashi et al. 2009). Though this delineates the flow of ionized gas, the position, shape, and direction of the outflow agree fairly well with the molecular outflow. All these clues indicate that high-dimensional PCA efficiently extracts an outflow phenomenon from an ALMA spectroscopic map, in a purely objective way.

Figure 18.

Figure 18. The region with velocity peculiarity in Figure 9.

Standard image High-resolution image

5.3. Spectral Features of the Doppler-corrected Data Corresponding to the PCs

In Figure 19, stars and triangles represent PC1- and PC2-related responsible spectral features similarly to Figure 10, and filled circles are PC3-related. First, we observe a global shift of lines compared to Figure 10. This is, of course, due to the recession velocity. Since the boundary between ALMA bands is fixed at the observer's frame, some parts of the spectra go out of the boundary after the Doppler shift correction. As a result, we lose significant parts of the spectra in Figure 19. After the Doppler correction, PC1 describes the intensity of HCN(4–3) and HNC(4–3) lines more specifically, which is reflected in the fact that responsible features concentrate on the peak of these lines. PC2 is assigned to the part of the profile slightly far from the line center for HCN(4–3). However, it is related to the lower-frequency side of the HNC(4–3) line, though it is difficult to see clearly because this side is cut out by the correction. PC3 further delineates the side of the higher-frequency side of the HCN(4–3) line, while it has some overlap with PC2-related features on the lower-frequency side. These properties were already suggested by eigenspectra (Figure 11), and we confirm them by the detailed information from A-SPCA. Thus, even if the spectra look formidably complicated, high-dimensional PCA methods like NRPCA and A-SPCA can disentangle the information and pick up the controlling spectral features from the spectroscopic map. Rather surprisingly, the detailed profile of HCN and HNC lines govern almost all spectral characteristics of the ALMA map of NGC 253, including the outflow phenomenon. Further investigation on detailed astrophysical processes from the ALMA map of NGC 253 is left to our future work.

Figure 19.

Figure 19. Responsible features to characterize PCs from the A-SPCA for the ALMA map of NGC 253, after the Doppler shift correction due to the systemic rotation. Stars and triangles represent PC1- and PC2-related responsible spectral features similar to those in Figure 10, while filled circles are PC3 related.

Standard image High-resolution image

6. Conclusion

The evolution of galaxies is mainly driven by star formation, a transition from ISM to stars. Various phases of the ISM are related, and the evolution of the ISM is key to completing the understanding of the galaxy's evolution. Spectroscopic observations are of vital importance to extract and interpret the information of matter in galaxies.

Though spectroscopic mapping and integral field spectroscopy are fundamentally important to reveal ISM physics, detailed observations are often very time-consuming, and it is not easy to map objects to obtain many independently sampled measurements. Consequently, we often find a situation where the dimension of data d is much larger than the sample size n, i.e., nd. This is referred to as HDLSS. The HDLSS data are different from typical big data in astronomy (usually n is huge), but d is extremely large. In traditional studies in astronomy, the HDLSS problem has been regarded as being ill-posed and simply given up for analysis. However, at the end of the last millennium, a strong need for an analysis method for such data arose (e.g., Golub et al. 1999). Then, during the subsequent decade, a new statistical method to tackle HDLSS problems was significantly developed, and is known as the high-dimensional statistical analysis (e.g., Aoshima & Yata 2017; Aoshima et al. 2018). This paper is organized twofold. First, we introduced the general framework of high-dimensional statistical methods, and then we applied the methodology to the ALMA spectroscopic map data.

For the HDLSS data, the full information on the sample covariance is expressed by a huge d × d covariance matrix, which is implausible to handle in various aspects. This specific difficulty of HDLSS data is often referred to as the curse of dimensionality. High-dimensional statistical analysis overcomes this problem by introducing the dual representation of the sample covariance matrix, which is a much smaller, n × n dimension square matrix. The remarkable property of the dual matrix is that it contains practically the same information on the first n eigenvalues of the original sample covariance matrix. Nevertheless, the dual sample covariance matrix is small enough to handle without the curse of dimensionality. In addition to this most difficult issue, many unusual behaviors are known for HDLSS data, such as strong inconsistency, spherical concentration for Gaussian variables, and axis concentration for non-Gaussian variables. High-dimensional statistics is the framework to make use of these specific aspects of HDLSS data and construct a high-dimensional counterpart of traditional methods.

After introducing the concept of HDLSS data and the corresponding methodology, we applied the high-dimensional version of PCA on the NGC253 ALMA spectral map (A17). ALMA mapping data are typically HDLSS in general, and as for the data in this work, n = 231 and d = 1971, clearly nd even though it is less extreme than the case in genomics, for example. The ALMA spectra are very rich in various molecular lines, and there is a very large variety in the molecular line intensities (A17).

We first applied NRPCA and A-SPCA, one of the latest methodologies in high-dimensional statistical analysis, to the original ALMA map of NGC 253. We found that high-dimensional PCA worked very well, and it could describe the global trend of the spectra only by the first two PCs (the contribution of the first and second PCs is >50%). PC1 approximately represents the total intensity of the spectra, and PC2 describes the Doppler shifts due to the systemic rotation of the observed region. Each PC consists of ∼5–20 elements, much fewer than d. We note that practically the number of the controlling factors reduces further since these contributing elements are part of the same spectral features. The controlling features were the HCN(4–3) and HNC(4–3) rotational lines. This result strongly guarantees that the high-dimensional statistical analysis methodology works appropriately for the spectral map as HDLSS data.

Then, we applied high-dimensional PCA to the Doppler-corrected ALMA map. The resulting eigenspectra of the NRPCA correspond to more specific spectral features, especially to the core or wings of the emission lines. While PC1 again represents the total intensity of the spectral map, PC2 appears to delineate a 10 pc scale outflow phenomenon of molecular gas. More impressive is that PC3 and PC4 represent a larger-scale coherent flow causing a blueshift of emission lines, and PC5 the redshift. The region of the ALMA map where PC3 is negative concentrates around the position of the radio continuum of NGC 253. This map then implies that the blueshift reflects the global larger-scale outflow from the radio center of the galaxy moving toward us. To examine this, we compared the region with the observed molecular outflow reported by Krieger et al. (2019). We find that the negative PC3 region agrees well with the root of the molecular outflow, named the SW streamer. This was also confirmed by the anomalous region of the velocity field in the Doppler-uncorrected map. Further, this region also gives a close agreement with the outflow of ionized gas (Matsubayashi et al. 2009). All these suggest that the PCA indeed extracted the dynamical features from the ALMA spectroscopic map, typical HDLSS data.

We stress that this result is not obtained by choosing a handful of features by hand, but by making use of the full information on the high-dimensional data. Spectroscopic maps are typical HDLSS data, and we can apply high-dimensional statistical analysis to various unsolved problems in astrophysics. As mentioned in Section 1, the condition also applies to spectroscopic mapping with satellite instruments, mainly due to the strong limitation of observation time and mission lifetime. The high-dimensional methodology will shed light on the spectral map of future space telescopes. For example, such surveys of nearby galaxies with SPICA have been discussed in detail. Though SPICA has been canceled, this methodology will remain useful for a future and/or possible replacement project. High-dimensional statistical analysis can also be applied to any type of data with a small sample size and large dimension, e.g., spectroscopy of rare objects.

Acknowledgments

We deeply thank the anonymous referee for the careful reading of the manuscript and constructive comments that greatly improved the clarity of the paper. This paper makes use of the following ALMA data: ADS/JAO.ALMA#2013.1.00099.S, ADS/JAO.ALMA#2013.1.00735.S. ALMA is a partnership of ESO (representing its member states), NSF (USA) and NINS (Japan), together with NRC (Canada), MOST and ASIAA (Taiwan), and KASI (Republic of Korea), in cooperation with the Republic of Chile. The Joint ALMA Observatory is operated by ESO, AUI/NRAO and NAOJ. This work has been supported by the Japan Society for the Promotion of Science (JSPS) Grants-in-Aid for Scientific Research (19H05076, 21H01128, 19K03937, and JP17H06130). This work has also been supported in part by the Sumitomo Foundation Fiscal 2018 Grant for Basic Science Research Projects (180923), the Collaboration Funding of the Institute of Statistical Mathematics "New Perspective of the Cosmology Pioneered by the Fusion of Data Science and Physics," and the NAOJ ALMA Scientific Research grant No. 2017-06B.

S.C. has been financially supported by the JSPS as a research fellow (DC1). We deeply thank the SPICA Science Team, especially members of the SPICA and the Galaxy and Nearby Galaxies Science Working Group (Fumi Egusa, Hanae Inami, Hiroyuki Kaneko, Itsuki Sakon, Yoichi Tamura, Junichi Baba, Kentaro Motohara, and Yoshimasa Watanabe), to which T.T.T. and K.N. belong. We are also grateful to Yugo Nakayama, Shiro Ikeda, and Nanase Harada for fruitful discussions.

Facility: ALMA - Atacama Large Millimeter Array

Software: astropy (Astropy Collaboration et al. 2013, 2018), casa (McMullin et al. 2007), topcat (Taylor 2005)

Appendix A: Merits of the Dual Sample Covariance

We note that the sample covariance matrix $\tilde{S}$ and the dual sample covariance ${\tilde{S}}_{{\rm{D}}}$ share their nonzero eigenvalues. We also note that the eigenvectors of $\tilde{S}$ can be calculated by the eigenvectors of ${\tilde{S}}_{{\rm{D}}}$. See Equation (13). Thus, the merits of the dual sample covariance (n × n) are rather obvious:

  • 1.  
    We can reduce the computational cost significantly.
  • 2.  
    We can visualize the behavior of data more easily than the original data matrix (d × d).
  • 3.  
    Asymptotic theory can be developed on the dual space.

Appendix B: Geometric Representations of ${\tilde{S}}_{{\rm{D}}}$

In this section, we give geometric representations of ${\tilde{S}}_{{\rm{D}}}$ when d while n is fixed (Yata & Aoshima 2010b; YA12). The geometric representations are keys to dealing with HDLSS data by high-dimensional statistical analysis.

We consider the following sphericity condition:

Equation (B1)

When $\tilde{X}$ is Gaussian, Ahn et al. (2007), Jung & Marron (2009), and YA12 showed a geometric representation as

Equation (B2)

Here, the λi s are eigenvalues of the covariance matrix $\tilde{{\rm{\Sigma }}}$, i.e., not the sample covariance $\tilde{S}$. Let

Equation (B3)

and ${{ \mathcal S }}^{n-1}\equiv \left\{{\boldsymbol{x}}\in {{\mathbb{R}}}^{n}\,| \,\parallel {\boldsymbol{x}}\parallel =1\right\}$: a unit (n − 1)-sphere. 26 YA12 showed that

Equation (B4)

that is, the surface concentration of the unit n-sphere. From the geometric representation by Equation (B4), we observe that the sample eigenvalues become deterministic and converge to the same value. This means that it becomes impossible to estimate the eigenvalues and the eigenvectors by using ${\tilde{S}}_{{\rm{D}}}$ (or $\tilde{S}$) as in the conventional method.

When $\tilde{X}$ is non-Gaussian, Yata & Aoshima (2010b) and YA12 discovered a different type of geometric representation,

Equation (B5)

where ${\tilde{D}}_{n}$ is a diagonal matrix whose diagonal elements are of OP(1), and ${\tilde{O}}_{n}$ is a zero matrix of dimension n × n. Let

Equation (B6)

(${{ \mathcal A }}^{n}\subset {{\mathbb{R}}}^{n}$). When $\tilde{X}$ is non-Gaussian, YA12 showed that

Equation (B7)

that is, the axis concentration of the n-dimensional dual space. From the geometric representation by Equation (B7), we can see that the eigenvectors of the data matrix converge to the unit vector of one of the axes of ${{\mathbb{R}}}^{n}$ and the eigenvalues would be indefinite.

We consider $({z}_{1k}^{2}-1,...,{z}_{{dk}}^{2}-1)\ (k=1,...,n)$ and its covariance matrix $\tilde{{\rm{\Phi }}}=({\phi }_{{ij}})$. YA12 considered a boundary that separates the two cases described by Equations (B4) and (B7) as follows.:

Equation (B8)

Note that Equation (B8) is equivalent to Equation (B1) under Condition (16). Based on this, YA12 proposed the following important theorem.

YA12 I.

Theorem B.1 Assume the sphericity condition (Equation (B1)). If the elements of $\tilde{Z}$ satisfy Equation (B8), we have Equations (B2) and (B4) as $d\to \infty $. Otherwise, Equation (B5) holds as $d\to \infty $, and under some regularity conditions, we have Equation (B7) as $d\to \infty $.

Now we are ready to examine the geometric behavior of the noise in the dual space. Consider a sample with a d-dimensional Gaussian distribution with a mean of μ = 0 and covariance matrix ${\tilde{{\rm{\Sigma }}}}_{d}={\tilde{I}}_{d}$. If we generate a sample n = 3 from this distribution, we obtain a sample distribution as displayed in Figure 20. For a low d, we have a sample distribution not very far from our intuition, i.e., the sample data concentrate on the origin and have a certain dispersion around it. However, for a higher d, the sample data concentrate on a sphere with a radius of ${(d/n)}^{\tfrac{1}{2}}$ (left panel of Figure 20). This clearly shows the situation of high-dimensional data, namely, the intrinsic feature of the data is completely overwhelmed by a tremendous disturbance from the huge noise sphere.

Figure 20.

Figure 20. A sample distribution drawn from two typical probability density distributions. Left: a sample generated from a d-dimensional Gaussian distribution with a mean of μ = 0 and a covariance matrix ${\tilde{{\rm{\Sigma }}}}_{d}={\tilde{I}}_{d}$ (${\tilde{I}}_{d}$: a d-dimensional identity matrix). Right: a sample generated from a d-dimensional t-distribution with td (0, Id , ν), with a mean of μ = 0, a covariance matrix Id , and a degrees of freedom ν = 5.

Standard image High-resolution image

The behavior of the sample is even more counterintuitive for a non-Gaussian case. If we generate a sample from a d-dimensional t-distribution td (0, Id , ν), with a mean of μ = 0, a covariance matrix Id, and degrees of freedom ν = 5, we observe a very particular behavior (right panel of Figure 20). All the data concentrate on one of the axes for the high-d cases. This is referred to as the axis concentration, typically seen for a non-Gaussian density distribution in high dimensions.

Thus, we understood the mathematical reason why we have the counterintuitive behaviors of the HDLSS data in Figure 20. This also gives the important basis of the central method we apply in this work.

Appendix C: Consistency for a Non-Gaussian Case

In this section, we give the consistency Equation (17) when Z does not satisfy Condition (16).

Yata & Aoshima (2009) II.

Theorem C.1 If Z does not satisfy Condition (16), Equation (17) holds for $i\leqslant m$ under the conditions

  • 1.  
    $d\to \infty $ and $n\to \infty $ for i that satisfies ${\alpha }_{i}\gt 1$.
  • 2.  
    $d\to \infty $ and ${d}^{2-2{\alpha }_{i}}/n(d)\to 0$ for i that satisfies ${\alpha }_{i}\in (0,1]$.

Theorems 2.1 and C.1 give the condition that the estimation by traditional PCA would have consistency. In (2), n is heavily dependent on d when the components of $\tilde{Z}$ do not satisfy Condition (16). For the non-Gaussian case, Yata & Aoshima (2010a) proposed a new robust PCA called the cross-data-matrix methodology.

Appendix D: Comparison between Traditional and High-dimensional PCA

As we mentioned in the main text, high-dimensional data such as the ALMA data are regarded as a sum of informative signals and noise. Particularly for the case of HDLSS data, the traditional statistical estimation is seriously hampered by the nuisance from the noise. In order to demonstrate the contribution of the noise sphere to the statistical estimation of HDLSS data, we present ratios of PCA eigenvalues estimated with traditional and high-dimensional PCA. For the latter, we show the estimates obtained by NRPCA. As we mentioned in the main text, the eigenvalues are significantly overestimated by traditional PCA when the data are HDLSS type (Equation (20)). This effect can be visualized by taking a ratio of eigenvalues obtained by traditional PCA (${\hat{\lambda }}_{i}$) and that obtained by NRPCA (${\check{\lambda }}_{i}$).

The result is presented in Figure 21. The left panel of Figure 21 is the result of the raw data (without Doppler velocity subtraction), while the right panel shows that of the Doppler-corrected data. In both cases, it is clearly shown that traditionally estimated eigenvalues are systematically larger than the NRPCA ones. We should note that the contribution of noise almost monotonically increases for higher-order eigenvalues, both for the raw and Doppler-corrected data sets. This is because the eigenvalues are smaller for higher-order indices with respect to the noise contribution (see Equation (20)).

Figure 21.

Figure 21. Ratio of eigenvalues obtained by traditional PCA (${\hat{\lambda }}_{i}$) and that obtained by high-dimensional (NR)PCA (${\check{\lambda }}_{i}$). Horizontal axes represent the index of eigenvalues, and vertical axes show the ratio ${\hat{\lambda }}_{i}/{\check{\lambda }}_{i}$ (i = 1, 2,...). The left panel of Figure 21 shows the result of the raw data (without Doppler velocity subtraction), while the right panel shows that of the Doppler-corrected data.

Standard image High-resolution image

We also note that since the data dimension d is not extremely higher than data size n, the overestimation of the eigenvalues is relatively small. We stress, however, that all the values are overestimated due to the additional contribution from noise. Namely, for forthcoming data that have a much stronger HDLSS property than that of ALCHEMI, this bias will become much larger. Thus, this examination shows the potential power of high-dimensional PCA and high-dimensional statistical analysis in general.

Appendix E: Example of a Doppler Shift Correction of a Line

We show an example of the Doppler shift correction, HCN(4–3) line in Figure 22. Red, green, and blue represent the corrected spectra from different spatial positions on the map of NGC 253, and we see that the line center agrees with its rest-frame frequency. Since each line has a complicated profile, the line center was determined by the Gaussian fitting. The blue vertical solid line represents the rest-frame frequency of the HCH(4–3) line, 354.5 GHz.

Figure 22.

Figure 22. Example of the Doppler shift-corrected molecular emission line, HCN(4–3). Red, green, and blue lines represent the corrected spectra so that the line center agrees with its rest-frame frequency.

Standard image High-resolution image

Footnotes

  • 10  
  • 11  
  • 12  
  • 13  
  • 14  
  • 15  
  • 16  
  • 17  

    The first consideration on the HDLSS asymptotics, however, dates back to the 1980s in the work of Casella & Hwang (1982).

  • 18  

    More precisely, $\mathop{\mathrm{lim}}\limits_{d/n\to 0}{\mathbb{P}}\left(\parallel {{\boldsymbol{x}}}_{n}-{\boldsymbol{\mu }}\parallel \gt \epsilon \right)=0$ for any epsilon > 0. Here, ∥ · ∥ stands for the Euclid norm, and ${\mathbb{P}}(A)$ is the probability of an event A. Similarly, $\mathop{\mathrm{lim}}\limits_{d/n\to \infty }{\mathbb{P}}\left(\parallel {{\boldsymbol{x}}}_{n}-{\boldsymbol{\mu }}\parallel \lt \epsilon \right)=0$ for any epsilon > 0 for Equation (3).

  • 19  

    One may be skeptical as to whether such a restrictive model can really describe the general properties of the eigenvalue distribution of real data. However, as we see in Sections 4 and 5, the generalized spiked model well represents the actual eigenvalues.

  • 20  

    In this context, consistency stands for the situation as follows: consider an estimator t(n) (n: sample size) is a consistent estimator for a parameter θ if and only if, for all epsilon > 0, we have

  • 21  

    If αi > 1/2, the NR methodology has consistency even when n is much smaller than d such as $n=\mathrm{log}d$.

  • 22  

    The value of ω = 1/2 was obtained by simulations.

  • 23  

    Note that $n\gt \mathrm{log}d=7.59$ when d = 1971 and n = 231, so that the NR methodology works well for the HDLSS data. See footnote 21.

  • 24  

    The existence of continuum radiation makes the situation even more complicated, but we do not try to examine this problem in the current work.

  • 25  

    We evaluated the advantage of NRPCA compared to traditional PCA through the eigenvalue estimation. This is presented in Appendix D.

  • 26  

    YA12 denoted ${{ \mathcal S }}^{n-1}$ as Rn We used a more familiar notation in geometry in this work.

Please wait… references are loading.
10.3847/1538-4365/ad2517