Cooling Improves Cosmic Microwave Background Map-making when Low-frequency Noise is Large

Bai-Chiang Chiang; Kevin M. Huffenberger

doi:10.3847/1538-4357/ac31ab

1. Introduction

In observations of the cosmic microwave background (CMB), map-making is an intermediate step between the collection of raw scanning data and the scientific analyses, such as the estimation of power spectra and cosmological parameters. Next generation CMB observations will generate much more data than those today, and so it is worth exploring efficient ways to process the data even though, on paper, the map-making problem has long been solved.

The time-ordered scanning data is summarized by

$\begin{eqnarray}&&{\boldsymbol{d}}=P{\boldsymbol{m}}+{\boldsymbol{n}},\end{eqnarray} \tag{ 1 }$

where d , m , and n are the vectors of time-ordered data (TOD), the CMB sky-map signal, and measurement noise. P is a sparse matrix in the time-by-pixel domain that encodes the telescope's pointing. Of several map-making methods (Tegmark 1997), one of the most common is the method introduced for the Cosmic Background Explorer (COBE; Janssen & Gulkis 1992). This optimal, linear solution is

$\begin{eqnarray}&&({P}^{\dagger }{N}^{-1}P)\hat{{\boldsymbol{m}}}={P}^{\dagger }{N}^{-1}{\boldsymbol{d}},\end{eqnarray} \tag{ 2 }$

where $\hat{{\boldsymbol{m}}}$ provides the standard generalized least squares minimization of the χ² statistic,

$\begin{eqnarray}&&{\chi }^{2}({\boldsymbol{m}})\equiv {\left({\boldsymbol{d}}-P{\boldsymbol{m}}\right)}^{\dagger }{N}^{-1}({\boldsymbol{d}}-P{\boldsymbol{m}}).\end{eqnarray} \tag{ 3 }$

Here we assume that the noise has zero mean 〈 n 〉 = 0, and the noise covariance matrix N = 〈 n n ^†〉 is diagonal in frequency space. In the case where the noise is Gaussian, the COBE solution is also the maximum likelihood solution.

With current computational power, we cannot solve for $\hat{{\boldsymbol{m}}}$ by calculating ${({P}^{\dagger }{N}^{-1}P)}^{-1}{P}^{\dagger }{N}^{-1}{\boldsymbol{d}}$ directly. The noise covariance matrix N is often sparse in the frequency domain and the pointing matrix P is sparse in the time-by-pixel domain. In experiments currently under design, there may be ∼10¹⁶ time samples and ∼10⁹ pixels, so these matrix inversions are intractable unless the covariance is uniform (proportional to the identity matrix I). We can use iterative methods, such as conjugate gradient descent, to avoid the matrix inversions, and execute each matrix multiplication in a basis where the matrix is sparse, using a fast Fourier transform to go between the frequency and time domain.

As an alternative to conjugate gradient descent, Huffenberger & Næss (2018) showed that the "messenger" iterative method could be adapted to solve the linear map-making system, based on the approach from Elsner & Wandelt (2013) to solve the linear Wiener filter. This technique splits the noise covariance into a uniform part and the remainder, and introduces an additional vector that represents the signal plus uniform noise. This messenger field acts as an intermediary between the signal and the data and has a covariance that is conveniently sparse in every basis. Elsner & Wandelt (2013) also introduced a cooling scheme that takes advantage of the split covariance: over the course of the iterative solution, we adjust the relative weight of those two parts. Starting with the uniform covariance, the modified linear system gradually transforms to the final system, under the control of a cooling parameter. In numerical experiments, Huffenberger & Næss (2018) found that a map produced by the cooled messenger method converged significantly faster than for standard conjugate gradient methods, and to higher fidelity, especially on large scales.

Papež et al. (2018) showed that the messenger field approach is equivalent to a fixed-point iteration scheme, and studied its convergence properties in detail. Furthermore, they showed that the split covariance and the modified system that incorporates the cooling can be solved by other means, including a conjugate gradient technique, which should generally show better convergence properties than the fixed-point scheme. With simulations, we have confirmed this conclusion. In their numerical tests, Papež et al. (2018) did not find benefits to the cooling modification of the map-making system, in contrast to the findings of Huffenberger & Næss (2018).

In this paper, we show that the difference arose because the numerical tests in Papež et al. (2018) used much less low-frequency (or 1/f) noise than Huffenberger & Næss (2018), and show that the cooling technique improves map-making performance especially when the low-frequency noise is large. This performance boost depends on a proper choice for the pace of cooling. Kodi Ramanah et al. (2017) showed that for the Wiener filter the cooling parameter should be chosen as a geometric series. In this work, we give an alternative interpretation of the parameterizing process and show that for map-making the optimal choice (unsurprisingly) is also a geometric series.

In Section 2 we describe our methods for treating the map-making equation and our numerical experiments. In Section 3 we present our results. In Section 4, we list our conclusions. In the Appendix we derive the prescription for our cooling schedule.

2. Methods

2.1. Parameterized Conjugate Gradient Method

The messenger field approach introduced an extra cooling parameter λ to the map-making equation, and solved the linear system with an alternative parameterized covariance $N(\lambda )=\lambda \tau I+\bar{N}$ . The parameter $\tau =\min (\mathrm{diag}(N))$ represents the uniform level of (white) noise in the original covariance. The remainder $\bar{N}\equiv N-\tau I$ is the nonuniform part of the original noise covariance. (Here N without any arguments denotes the original noise covariance matrix N = 〈 n n ^†〉.) In this work we find it more convenient to work with the reciprocal of cooling parameter η = λ⁻¹, which represents the degree of heteroscedasticity (nonuniformity) in the parameterized covariance

$\begin{eqnarray}&&N(\eta )=\tau I+\eta \bar{N}.\end{eqnarray} \tag{ 4 }$

When η = 1 this parameterized covariance N(η) equals N.

Papež et al. (2018) showed that the conjugate gradient method can be easily applied to the cooled map-making problem. In our notation, this is equivalent to iterating on the parameterized map-making equation

$\begin{eqnarray}&&\left({P}^{\dagger }N{({\eta }_{i})}^{-1}P\right)\hat{{\boldsymbol{m}}}({\eta }_{i})={P}^{\dagger }N{\left({\eta }_{i}\right)}^{-1}{\boldsymbol{d}},\end{eqnarray} \tag{ 5 }$

as we adjust the parameter through a set of levels {η_i}. (We use $\hat{{\boldsymbol{m}}}$ with no η argument to mean the estimated $\hat{{\boldsymbol{m}}}$ in Equation (2), independent of η.) This equation leads to the same system as the parameterized equation in the messenger field method, because N(η) = λ⁻¹ N(λ) and the condition number do not change upon scalar multiplication to both sides of the equation. For concreteness we fix the preconditioner to M = P^† P for all calculations.

When η = 0, the noise covariance matrix N(0) is homoscedastic (uniform), and the solution is given by the simple binned map $\hat{{\boldsymbol{m}}}(0)={\left({P}^{\dagger }P\right)}^{-1}{P}^{\dagger }{\boldsymbol{d}}$ , which can be solved directly.

Since the non-white part $\bar{N}$ is the troublesome portion of the covariance, we can think of the η parameter as increasing the heteroscedasticity of the system, adding a perturbation to the solution achieved at a particular stage, building ultimately upon the initial uniform covariance model. Therefore, this quasi-static process requires η increase as 0 = η₀ ≤ η₁ ≤ ⋯ ≤ η_final = 1, at which point we arrive at the desired map-making equation, and the solution $\hat{{\boldsymbol{m}}}(1)=\hat{{\boldsymbol{m}}}$ .

We may iterate more than once at each intermediate η_i: we solve Equation (5) with conjugate gradient iterations using the result from the previous calculation $\hat{{\boldsymbol{m}}}({\eta }_{i-1})$ as the initial value. We move to the next parameter η_i+1 when the norm of the residual vector

$\begin{eqnarray}&&\parallel {\boldsymbol{r}}({\boldsymbol{m}},{\eta }_{i})\parallel \equiv \parallel {P}^{\dagger }N{\left({\eta }_{i}\right)}^{-1}P\,{\boldsymbol{m}}-{P}^{\dagger }N{\left({\eta }_{i}\right)}^{-1}{\boldsymbol{d}}\parallel \end{eqnarray} \tag{ 6 }$

is an order of magnitude smaller than the norm of the right-hand side of Equation (5).

$\begin{eqnarray}&&\parallel {\boldsymbol{r}}({\boldsymbol{m}},{\eta }_{i})\parallel \lt 0.1\ \parallel {P}^{\dagger }N{\left({\eta }_{i}\right)}^{-1}{\boldsymbol{d}}\parallel .\end{eqnarray} \tag{ 7 }$

This is not stringent enough to completely converge at this η-level, but we find that it causes the system to converge sufficiently to allow us to move on to the next η.

2.2. Analytical Expression for {η_i} Series

The next question is how to appropriately choose these monotonically increasing parameters η. We also want to determine η₁, ⋯ ,η_n−1 before starting conjugate gradient iterations, because the time-ordered data d is very large, and we do not want to keep it in the system memory during calculation or repeatedly read it in from the disk. If we determine η₁, ⋯ ,η_n−1 before the iterations, then we can precompute the right-hand side of Equation (5) for each η_i and keep these map-sized objects, instead of the entire time-ordered data.

In the Appendix, we show that a generic good choice for the η parameters is given by this geometric series

$\begin{eqnarray}&&{\eta }_{i}=\min \left\{({2}^{i}-1)\displaystyle \frac{\tau }{\max ({\bar{N}}_{f})},\,1\right\},\end{eqnarray} \tag{ 8 }$

where ${\bar{N}}_{f}$ are the eigenvalues of $\bar{N}$ under frequency representation. This is one of our main results. It not only tells us how to choose parameters η_i, but also when we should stop the perturbation, and set η = 1. For example, if the noise covariance matrix N is almost uniform, then $\bar{N}=N-\tau I\approx 0$ , and we would have $\tau /\max ({\bar{N}}_{f})\gt 1$ . This tell us that we do not need to use the parameterized method at all, because η₀ = 0 and η₁ = η₂ = ⋯ = 1. This corresponds to the standard conjugate gradient method with a simple binned map as the initial guess (as recommended by Papež et al. 2018).

2.3. Intuitive Interpretation of η

This section includes a way to interpret the role of η that is less technical than in the Appendix. Our ultimate goal is to find $\hat{{\boldsymbol{m}}}(1)$ , which minimizes χ²( m ) in Equation (3). Since N is diagonal in frequency space, χ² could be written as a sum of all frequency modes ∣( d − P m )_f∣² with weight ${N}_{f}^{-1}$ , such as ${\chi }^{2}{({\boldsymbol{m}})={\sum }_{f}| ({\boldsymbol{d}}-P{\boldsymbol{m}})}_{f}{| }^{2}{N}_{f}^{-1}$ . The weight is large for low-noise frequency modes (small N_f), and small for high-noise modes, which means χ²( m ) would favor the low-noise modes, and therefore the conjugate gradient map-making focuses on minimizing the error *ε ≡ d − P m in the low-noise part.

After introducing η, we minimize χ²( m , η) in Equation (A1) instead. For η = 0, N⁻¹(0) ∝ I the system is homoscedastic and the estimated map $\hat{{\boldsymbol{m}}}(0)$ does not prioritize any frequency modes. As we slowly increase η, we decrease the weight for the high-noise modes, and focus on minimizing error for the low-noise part. If we start with η₁ = 1 directly, which corresponds to the vanilla conjugate gradient method, then the algorithm will focus mostly on minimizing the low-noise part, such that χ² would converge very quickly on the low-noise modes (typically high temporal frequencies and small spatial scales), but slowly on the high-noise modes (low frequencies and large scales). However, by introducing the η parameter, we let the solver first treat every frequency equally. Then as η slowly increases, it gradually gives more focus on the lowest noise part.

2.4. Computational Cost

To properly compare the performance cost of this method with respect to the vanilla conjugate gradient method with the simple preconditioner, we need to compare their computational cost at each iteration. We could define $A(\eta )\equiv {P}^{\dagger }N{\left(\eta \right)}^{-1}P$ and ${\boldsymbol{b}}(\eta )\equiv {P}^{\dagger }N{\left(\eta \right)}^{-1}{\boldsymbol{d}}$ , and Equation (5) could be written as $A({\eta }_{i})\hat{{\boldsymbol{m}}}({\eta }_{i})={\boldsymbol{b}}({\eta }_{i})$ . The right-hand side b (η_i) could be computed before iterating, since we have determined η_i in advance, so it will not introduce extra computational cost. The most demanding part of the conjugate gradient method is calculating its left-hand side A(η_i) m , because it contains a Fourier transform of P m from the time domain to frequency domain and an inverse Fourier transform of $N{\left({\eta }_{i}\right)}^{-1}P{\boldsymbol{m}}$ from the frequency domain back to time domain, which is order ${ \mathcal O }(n\mathrm{log}n)$ with n being the length of time-ordered data. Compared to the traditional conjugate gradient method, we swap N⁻¹ with $N{\left(\eta \right)}^{-1}$ , and the cost is the same for one step, since both methods need a fast Fourier transform and inverse fast Fourier transform at one iteration.

At each η_i level, we use the residual to determine whether to switch to the next level (η_i+1), as in Equation (7). Calculation of the residual vector r ( m , η_i) is part of the conjugate gradient algorithm, so this will not add extra cost either. Therefore, overall introducing the η will not have extra computational cost within the conjugate gradient iterations.

However, we start a new conjugate gradient algorithm whenever η_i updates to η_i+1. Thus we must reinitialize the conjugate gradient algorithm, recalculating the residual r ( m , η_i+1) based on new η_i+1. This residual calculation contains an extra A(η_i) m operation. Therefore, if we have a series ${\eta }_{1},{\eta }_{2},{\eta }_{3},\cdots ,{\eta }_{{n}_{\eta }}$ , there will have n_η − 1 extra A(η) m operations compare to the traditional conjugate gradient method. If the total number of iterations is much larger than n_η, then this extra cost is negligible. For our simulation, this extra step would have a rather significant impact on the final result. To have a fair comparison between the parameterized and traditional conjugate gradient method, we will present our results with the number of ${P}^{\dagger }N{\left(\eta \right)}^{-1}P{\boldsymbol{m}}$ operations as the horizontal axis.

2.5. Numerical Simulations

To compare these algorithms, we need to do some simple simulations of scanning processes, and generate the time-ordered data from a random sky signal.¹ Our sky is a small rectangular area, with two orthogonal directions x and y, both with ranges from –1°to +1°. The signal has Stokes parameters (I, Q, U) for intensity and linear polarization.

For the scanning process, our mock telescope contains nine detectors, each with different sensitivity to polarization Q and U. It scans the sky with a raster scanning pattern. Its scanning frequency is f_scan = 0.1 Hz and sampling frequency is f_sample = 100 Hz. The telescope scans the sky horizontally then vertically. This gives the noiseless signal s . The sky signal in the timestream has an rms of 56 μK. The signal is continuous, so that it has structure on subpixel scales, but we find that our main conclusions remain the same when the input signal is pixelized. In map-making, we digitize the position (x, y) into 512 × 512 pixels.

We model the noise power spectrum with

$\begin{eqnarray}&&P(f)={\sigma }^{2}\left(1+\displaystyle \frac{{f}_{\mathrm{knee}}^{\alpha }+{f}_{\mathrm{apo}}^{\alpha }}{{f}^{\alpha }+{f}_{\mathrm{apo}}^{\alpha }}\right),\end{eqnarray} \tag{ 9 }$

which is white at high frequencies, a power law below the knee frequency, and gives us the option to flatten the low-frequency noise below an apodization frequency (like in Papež et al. 2018). Note that as f_apo → 0, $P(f)\to {\sigma }^{2}\left(1+{\left(f/{f}_{\mathrm{knee}}\right)}^{-\alpha }\right)$ , and it becomes a 1/f-type noise model.

Dünner et al. (2013) measured the slopes of the atmospheric noise in the Atacama under different water vapor conditions, finding α = 2.7 to 2.9. Here we use σ² = 10 μK², α = 3, and compare the performance under different noise models. In our calculations, we choose different combinations of f_knee and f_apo as in Figure 1. The noise spectra with the most low-frequency noise have high f_knee or low cutoff f_apo.

**Figure 1.** Noise power spectra that we use in our map-making simulations. These show a variety of low-frequency behavior, parameterized by Equation (9), with white noise at high frequency and a low-frequency power-law slope α = 3. Here we show two knee frequencies, f_knee = 10 Hz (solid lines) and f_knee = 0.1 Hz (dashed lines). For each knee frequency, we have shown an unflattened spectrum (f_apo = 0 Hz), and two flattened ones (f_apo = 0.1f_knee and 0.01f_knee). The vertical line shows our scanning frequency.
Download figure:
Standard image High-resolution image

The noise covariance matrix

$\begin{eqnarray}&&{N}_{{ff}^{\prime} }=P(f)\displaystyle \frac{{\delta }_{{ff}^{\prime} }}{{{\rm{\Delta }}}_{f}}\end{eqnarray} \tag{ 10 }$

is a diagonal matrix in frequency space, where Δ_f is equal to the reciprocal of total scanning time T ≈ 1.05 × 10⁴ s.

Finally, we get the simulated time-ordered data d = s + n by adding up the signal and noise.

3. Results

We compare the standard conjugate gradient method versus the conjugate gradient with our perturbed linear system. Both methods use the simple preconditioner P^† P. Figure 2 shows the χ² results for 1/f noise models (f_apo = 0) with different knee frequencies. Note that the χ² values in all figures are calculated based on the standard χ²( m ) in Equation (3), not the η-dependent χ²( m , η) of the modified system (Equation (A1)). The minimum ${\chi }_{\min }^{2}$ that we use for comparison is calculated from a deliberately slowed and well-converged parameterized conjugate gradient method: one with 100 η values and that halts when the final norm of the residual $\parallel {\boldsymbol{r}}({\boldsymbol{m}},1)\parallel$ is smaller than ${10}^{-5}\times \parallel {P}^{\dagger }{N}^{-1}{\boldsymbol{d}}\parallel$ , or 100 iterations after η = 1. From Figure 2, we can see for the 1/f noise model, when f_knee ≳ 10f_scan the parameterized method starts showing an advantage over the vanilla conjugate gradient method.

**Figure 2.** Convergence properties depend on the amount of low-frequency noise, which increases from the left panel to the right panel with increasing knee frequency. The map-making Equation (2) minimizes the χ²( m ), so the curve, which falls fastest versus the number of operations is the preferred method. We compare the traditional conjugate gradient method ("CG," blue line) with the parameterized conjugate gradient method ("CG with η," orange line) under different 1/f noise models (fixed f_apo = 0 Hz but different f_knee in Equation (9)). When f_knee ≳ 10 f_scan = 1 Hz, the significant amount of low-frequency noise causes the parameterized conjugate gradient method to start showing its advantage. The vertical axis is rescaled such that all curves start from 1.
Download figure:
Standard image High-resolution image

In Figure 3, we fixed f_knee = 10 Hz, and change f_apo. As we decrease f_apo relative to f_knee, increasing the amount of low-frequency noise, the parameterized conjugate gradient method performs better.

Looking at the power spectrum in Figure 1, when f_knee is small, or f_apo is large, there is not much low-frequency noise. These situations correspond to the left-side plots in Figure 2 and Figure 3. The right-side graphs have a significant amount of low-frequency noise. We conclude that the introduction of the slowly varying η parameter improves performance most when there are large low-frequency noise contributions.

We also tried different 1/f noise slopes α. For α = 2, the conclusion is the same as α = 3. When α = 1, the low-frequency noise is reduced compared to the cases with steeper slopes, and the vanilla conjugate gradient method is preferred, except some cases with very large knee frequency like f_knee = 100 Hz and f_apo = 0, which favors the parameterized method. In Papež et al. (2018), the slope α = 1 and the noise power spectrum is flattened at f_apo ≈ 0.1f_knee. Their knee frequency is the same as their scanning frequency, so is most like our case when f_knee = f_scan = 0.1 Hz. Their case had little low-frequency noise, and we confirm their specific result that the standard conjugate gradient method converges faster in that case. In general, however, we find that cases with significantly more low-frequency noise benefit from the cooling/parameterized approach.

4. Conclusions

We analyzed the parameterized conjugate gradient map-making method that is inspired by the messenger field idea of separating the white noise out of the noise covariance matrix. Then we gave an analytical expression for the series of η parameters that govern how quickly the modified covariance adjusts to the correct covariance, and showed that this method adds only the extra computational cost of reinitializing the conjugate gradient process based on a new η parameter.

We tested this method for different noise power spectra, both flattened and nonflattened at low frequency. The results showed that the parameterized method is faster than the traditional conjugate gradient method when there is a significant amount of low-frequency noise. It could be further improved if we could get a more accurate estimation for the change in χ² as a function of the η parameter, either before iteration or without using time-ordered data during iteration.

Also note that we fixed the preconditioner as M = P^† P during our calculation, this parameterizing process could be applied to any preconditioner and possibly improve performance when there is a significant amount of low-frequency noise.

This type of analysis for the cooling parameter may also be apply to other areas, like the Wiener filter. Papež et al. (2018) showed that the messenger field method of Elsner & Wandelt (2013) for solving Wiener filter problem could also be written as a parameterized conjugate gradient algorithm. It stands to reason that such a system may also benefit from splitting and parameterizing its noise covariance, depending on the noise properties. (In the Wiener filter, Kodi Ramanah et al. 2017 additionally suggest that splitting the signal covariance and combining the uniform parts of the signal and noise.)

The benefits to map-making from a cooled messenger method seem to come from the cooling and not actually from the messenger field that inspired it. However, the messenger field approach may still have a role in the production and analysis of CMB maps. In particular, the close connection between the messenger method and Gibbs sampling may allow us to cheaply generate noise realizations of a converged map by generating samples from the map posterior distribution, something that we will continue to explore in future work.

For this work, B.C. and K.M.H. are supported by NSF award 1815887.

Appendix: The Derivation of η Parameter Series

We know that the initial degree of heteroscedasticity η₀ = 0, which means the system is homoscedastic (uniform noise) to start. What would be a good value for the next parameter η₁? To simplify notation, we use N_η to denote the parameterized covariance matrix $N(\eta )=\tau I+\eta \bar{N}$ . For some specific η value, the estimated map $\hat{{\boldsymbol{m}}}{(\eta )=({P}^{\dagger }{N}_{\eta }^{-1}P)}^{-1}{P}^{\dagger }{N}_{\eta }^{-1}{\boldsymbol{d}}$ minimizes

$\begin{eqnarray}&&{\chi }^{2}({\boldsymbol{m}},\eta )={\left({\boldsymbol{d}}-P{\boldsymbol{m}}\right)}^{\dagger }{N}_{\eta }^{-1}\left({\boldsymbol{d}}-P{\boldsymbol{m}}\right).\end{eqnarray} \tag{ A1 }$

with η being fixed. We restrict to the case that the noise covariance matrix N is diagonal in the frequency domain, and represent the frequency-domain eigenvalues as N_f.

The perturbative scheme works like this. We start with ${\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{0}),{\eta }_{0})$ with $\hat{{\boldsymbol{m}}}{({\eta }_{0})=({P}^{\dagger }P)}^{-1}{P}^{\dagger }{\boldsymbol{d}}$ , which could be solved directly. Then we use the conjugate gradient method to find $\hat{{\boldsymbol{m}}}({\eta }_{1})$ and its corresponding ${\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{1}),{\eta }_{1})$ . So let us consider η₁ = η₀ + δ η = δ η such that η₁ = δ η is a very small quantity, δ η ≪ 1. (Remember η₀ = 0.) Since $\hat{{\boldsymbol{m}}}(\eta )$ minimizes χ²( m , η) with η being fixed, we have $\tfrac{\partial }{\partial \hat{{\boldsymbol{m}}}}{\chi }^{2}(\hat{{\boldsymbol{m}}}(\eta ),\eta )=0$ , and using the chain rule

$\begin{eqnarray}\begin{array}{rcl}\displaystyle \frac{{\rm{d}}}{{\rm{d}}\eta }{\chi }^{2}(\hat{{\boldsymbol{m}}}(\eta ),\eta ) & = & \displaystyle \frac{\partial }{\partial \eta }{\chi }^{2}(\hat{{\boldsymbol{m}}}(\eta ),\eta )\\ & = & -{\left({\boldsymbol{d}}-P\hat{{\boldsymbol{m}}}(\eta )\right)}^{\dagger }{N}_{\eta }^{-1}\bar{N}{N}_{\eta }^{-1}({\boldsymbol{d}}-P\hat{{\boldsymbol{m}}}(\eta )).\end{array}\end{eqnarray} \tag{ A2 }$

Then the fractional decrease of ${\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{0}),{\eta }_{0})$ from η₀ to η₁ = δ η is

$\begin{eqnarray}\begin{array}{rcl}-\displaystyle \frac{\delta {\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{0}),{\eta }_{0})}{{\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{0}),{\eta }_{0})} & = & -\delta \eta \displaystyle \frac{\tfrac{{\rm{d}}}{{\rm{d}}\eta }{\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{0}),{\eta }_{0})}{{\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{0}),{\eta }_{0})}\\ & = & \delta \eta \displaystyle \frac{1}{\tau }\displaystyle \frac{{\left({\boldsymbol{d}}-P\hat{{\boldsymbol{m}}}({\eta }_{0})\right)}^{\dagger }\bar{N}({\boldsymbol{d}}-P\hat{{\boldsymbol{m}}}({\eta }_{0}))}{{\left({\boldsymbol{d}}-P\hat{{\boldsymbol{m}}}({\eta }_{0})\right)}^{\dagger }\left({\boldsymbol{d}}-P\hat{{\boldsymbol{m}}}({\eta }_{0})\right)}.\end{array}\end{eqnarray} \tag{ A3 }$

Here we put a minus sign in front of this expression such that it is nonnegative, and use N_η=0 = τ I at the second equality. We want $| \delta {\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{0}),{\eta }_{0})| ={\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{0}),{\eta }_{0}))-{\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{1}),{\eta }_{1})$ to be large to encourage fast convergence. Therefore, ${\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{1}),{\eta }_{1})$ is much smaller than ${\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{0}),{\eta }_{0})$ , or ${\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{1}),{\eta }_{1})\ll {\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{0}),{\eta }_{0})$ . Then we would expect

$\begin{eqnarray}&&-\displaystyle \frac{\delta {\chi }^{2}(\hat{{\boldsymbol{m}}}(0),0)}{{\chi }^{2}(\hat{{\boldsymbol{m}}}(0),0)}=1-\displaystyle \frac{{\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{1}),{\eta }_{1})}{{\chi }^{2}(\hat{{\boldsymbol{m}}}(0),0)}\approx {1}^{-}.\end{eqnarray} \tag{ A4 }$

Here we use the notation 1⁻, whichmeans the upper bound is close to but strictly smaller than 1. Now we can use Equation (A3) and let it equal 1; thus $\delta \eta =-{\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{0}),{\eta }_{0})/\tfrac{{\rm{d}}}{{\rm{d}}\eta }{\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{0}),{\eta }_{0})$ .

Applying this same idea to η_m+1 = η_m + δ η_m with m ≥ 1, we would get

$\begin{eqnarray}&&\delta {\eta }_{m}=-{\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{m}),{\eta }_{m})/\displaystyle \frac{{\rm{d}}}{{\rm{d}}\eta }{\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{m}),{\eta }_{m}).\end{eqnarray} \tag{ A5 }$

As mentioned in the main text, we need to determine the entire series η_i before conjugate gradient iterations. We do not have the $\hat{{\boldsymbol{m}}}({\eta }_{m})$ with which to calculate them and need to find another approach.

Let us go back to Equation (A3). Since we cannot calculate ${\boldsymbol{d}}-P\hat{{\boldsymbol{m}}}({\eta }_{m})$ before making the map, we treat it as an arbitrary vector, then the least upper bound of Equation (A3) is given by

$\begin{eqnarray}&&-\displaystyle \frac{\delta {\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{0}),{\eta }_{0})}{{\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{0}),{\eta }_{0})}\leqslant \displaystyle \frac{\delta \eta }{\tau }\max ({\bar{N}}_{f}),\end{eqnarray} \tag{ A6 }$

where $\max ({\bar{N}}_{f})$ is the maximum eigenvalue of $\bar{N}$ . We want $-\tfrac{\delta {\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{0}),{\eta }_{0})}{{\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{0}),{\eta }_{0})}$ to be as large as possible, but it will not exceed 1. We combine Equations (A4) and (A6), and choose δ η such that the least upper bound is equal to 1, to make sure the process is not going too fast. Thus we have

$\begin{eqnarray}&&{\eta }_{1}=\delta \eta =\displaystyle \frac{\tau }{\max ({\bar{N}}_{f})}=\displaystyle \frac{\min ({N}_{f})}{\max ({N}_{f})-\min ({N}_{f})}.\end{eqnarray} \tag{ A7 }$

Here N_f and ${\bar{N}}_{f}$ are the eigenvalues of N and $\bar{N}$ in the frequency domain. If the condition number of noise covariance matrix $\kappa (N)=\max ({N}_{f})/\min ({N}_{f})\gg 1$ , then η₁ ≈ κ⁻¹(N).

What about the other parameters η_m with m > 1? We use a similar analysis, letting η_m+1 = η_m + δ η_m with a small δ η_m ≪ 1. First, let us find the least upper bound

$\begin{eqnarray}&&\begin{array}{l}-\displaystyle \frac{\delta {\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{m}),{\eta }_{m})}{{\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{m}),{\eta }_{m})}\\ =\ \delta {\eta }_{m}\displaystyle \frac{{\left({\boldsymbol{d}}-P\hat{{\boldsymbol{m}}}({\eta }_{m})\right)}^{\dagger }{N}_{{\eta }_{m}}^{-1}\bar{N}{N}_{{\eta }_{m}}^{-1}({\boldsymbol{d}}-P\hat{{\boldsymbol{m}}}({\eta }_{m}))}{{\left({\boldsymbol{d}}-P\hat{{\boldsymbol{m}}}({\eta }_{m})\right)}^{\dagger }{N}_{{\eta }_{m}}^{-1}\left({\boldsymbol{d}}-P\hat{{\boldsymbol{m}}}({\eta }_{m})\right)}\end{array}\end{eqnarray} \tag{ A8 }$

$\begin{eqnarray}&&\leqslant \ \delta {\eta }_{m}\,\max \left(\displaystyle \frac{{\bar{N}}_{f}}{\tau +{\eta }_{m}{\bar{N}}_{f}}\right).\end{eqnarray} \tag{ A9 }$

The upper bound in the second line is a little bit tricky. Both matrix $\bar{N}$ and ${N}_{{\eta }_{m}}^{-1}$ can be simultaneously diagonalized in frequency space. For each eigenvector e _f, the corresponding eigenvalue of the matrix on the numerator ${N}_{{\eta }_{m}}^{-1}\bar{N}{N}_{{\eta }_{m}}^{-1}$ is ${\lambda }_{f}={\bar{N}}_{f}{\left(\tau +{\eta }_{m}{\bar{N}}_{f}\right)}^{-2}$ , and the eigenvalue for the matrix in the denominator ${N}_{{\eta }_{m}}^{-1}$ is ${\gamma }_{f}={\left(\tau +{\eta }_{m}{\bar{N}}_{f}\right)}^{-1}$ . Their eigenvalues are related by ${\lambda }_{f}=[{\bar{N}}_{f}/(\tau +{\eta }_{m}{\bar{N}}_{f})]{\gamma }_{f}$ . For any vector v = ∑_f α_f e _f, we have

$\begin{eqnarray}\begin{array}{rcl}\displaystyle \frac{{{\boldsymbol{v}}}^{\dagger }{N}_{{\eta }_{m}}^{-1}\bar{N}{N}_{{\eta }_{m}}^{-1}{\boldsymbol{v}}}{{{\boldsymbol{v}}}^{\dagger }{N}_{{\eta }_{m}}^{-1}{\boldsymbol{v}}} & = & \displaystyle \frac{{\sum }_{f}{\alpha }_{f}^{2}{\lambda }_{f}}{{\sum }_{f}{\alpha }_{f}^{2}{\gamma }_{f}}\\ & = & \displaystyle \frac{{\sum }_{f}{\alpha }_{f}^{2}{\gamma }_{f}{\bar{N}}_{f}/(\tau +{\eta }_{m}{\bar{N}}_{f})}{{\sum }_{f}{\alpha }_{f}^{2}{\gamma }_{f}}\\ & \leqslant & \max \left(\displaystyle \frac{{\bar{N}}_{f}}{\tau +{\eta }_{m}{\bar{N}}_{f}}\right).\end{array}\end{eqnarray} \tag{ A10 }$

Again assuming ${\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{m+1}),{\eta }_{m+1})\ll {\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{m}),{\eta }_{m})$ , which we expect to be satisfied for η_m ≪ 1. That is because if η ≲ 1, ${\chi }^{2}(\hat{{\boldsymbol{m}}}(\eta ),\eta )$ would be close to the minimum χ², which means ${\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{m+1}),{\eta }_{m+1})\lesssim {\chi }^{2}(\hat{{\boldsymbol{m}}}({\eta }_{m}),{\eta }_{m})$ , which would violate our assumption. Luckily, the final result (Equation (A14)) is a geometric series, only the last few η_m values fail to satisfy this condition. Similarly, we could set the least upper bound equal to 1. Then we get

$\begin{eqnarray}&&\delta {\eta }_{m}=\min \left(\displaystyle \frac{\tau +{\eta }_{m}{\bar{N}}_{f}}{{\bar{N}}_{f}}\right)={\eta }_{m}+\displaystyle \frac{\tau }{\max ({\bar{N}}_{f})}.\end{eqnarray} \tag{ A11 }$

Therefore

$\begin{eqnarray}&&{\eta }_{m+1}={\eta }_{m}+\delta {\eta }_{m}=2{\eta }_{m}+\displaystyle \frac{\tau }{\max ({\bar{N}}_{f})}.\end{eqnarray} \tag{ A12 }$

If written in the form ${\eta }_{m+1}+\tau /\max ({\bar{N}}_{f})=2\left({\eta }_{m}+\tau /\max ({\bar{N}}_{f})\right)$ it is easy to see that for m ≥ 1, ${\eta }_{m}+\tau /\max ({\bar{N}}_{f})$ forms a geometric series

$\begin{eqnarray}\begin{array}{rcl}{\eta }_{m}+\displaystyle \frac{\tau }{\max ({\bar{N}}_{f})} & = & \left({\eta }_{1}+\displaystyle \frac{\tau }{\max ({\bar{N}}_{f})}\right){2}^{m-1}\\ & = & \displaystyle \frac{\tau }{\max ({\bar{N}}_{f})}{2}^{m},\end{array}\end{eqnarray} \tag{ A13 }$

where we used ${\eta }_{1}=\tau /\max ({\bar{N}}_{f})$ . Note that m = 0 and η₀ = 0 also satisfy this expression and we have the final expression for all η_m

$\begin{eqnarray}&&{\eta }_{m}=\min \left\{1,\,\displaystyle \frac{\tau }{\max ({\bar{N}}_{f})}({2}^{m}-1)\right\}.\end{eqnarray} \tag{ A14 }$

Here we need to truncate the series when η_m > 1.

In numerical simulations, we find that if we update the η parameter based on the more precise Equation (A5), there is only marginal improvement over the η series given in Equation (A14) for the 1/f noise model, and slight improvement when there is not much low-frequency noise. If we use Equation (A5), it ends up with fewer η parameters in the series, but the interval between η_i and η_i+1 gets larger. In our simulation this sometimes causes one more iteration at a certain η value, so in the end there are only slight improvements. For a large data set that needs a lot of iterations to converge from η_i to η_i+1, where several extra iterations may not be significant, using Equation (A5) may provide a larger performance boost.

Cooling Improves Cosmic Microwave Background Map-making when Low-frequency Noise is Large

Article metrics

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Methods

2.1. Parameterized Conjugate Gradient Method