AESTRA: Deep Learning for Precise Radial Velocity Estimation in the Presence of Stellar Activity

Yan Liang; Joshua N. Winn; Peter Melchior

doi:10.3847/1538-3881/ad0e01

1. Introduction

The search for Earth-like planets in the habitable zones of Sun-like stars is one of the premier challenges of modern astronomy. Despite significant advancements in instrumentation and data analysis techniques, such as the achievement of ∼0.3 m s⁻¹ precision in recent ESPRESSO observations of Proxima Centauri (Mascareño et al. 2020), the detection of Earth-like planets will require further advances. The next generation of Extremely Precise Radial Velocity (EPRV) instruments is designed to achieve an instrumental precision of ∼0.1 m s⁻¹ (Hall et al. 2018; Blackman et al. 2020). However, the emergent spectrum of a star is inherently variable, leading to apparent variations in the star's radial velocity that can mimic or mask planetary signals. This "radial-velocity jitter" arises from magnetic activity, granulation, and acoustic oscillations, which have different characteristic amplitudes and timescales (Haywood et al. 2014; Milbourne et al. 2019; Mascareño et al. 2020). Given that even the quietest stars exhibit RV jitter of approximately 2 m s⁻¹ (Brems et al. 2019), it is essential to mitigate the influence of stellar activity signals on EPRV measurements if we are to have any chance of detecting Earth-like exoplanets.

Numerous data analysis techniques have been proposed to determine true RVs from stellar spectra in the presence of activity. A common approach is to extract RVs using traditional techniques, such as cross-correlation with a spectral template, and then decorrelate the apparent RV time series with various activity indices that are derived independently from the spectra (Giguere et al. 2016; Milbourne et al. 2019). Another approach, sometimes combined with the previous approach, is to model the RV time series using Gaussian Process (GP) regression to represent the effects of activity (Haywood et al. 2014; Mascareño et al. 2020). There are also data-driven techniques, such as performing shift-invariant operations on the cross-correlation function (CCF) to isolate activity-induced signals (Collier Cameron et al. 2021), and performing full spectral modeling to address contamination due to variability in telluric absorption lines (Bedell et al. 2019). Supervised machine-learning techniques have been applied to recognize spurious RV signals associated with changes in shape of the CCF (de Beurs et al. 2022).

On the observational side, there has been a growing trend toward more intensive and longer-term monitoring of stars. It is no longer unusual to see hundreds of RVs gathered for a single object, and we will probably begin seeing thousands of RVs in the coming decade (Crass et al. 2021). For example, the Terra Hunting Experiment will use a high-resolution echelle spectrograph and an automated 2.5 m telescope to measure RVs of a few dozen stars on every clear night for at least 10 yr (Thompson et al. 2016; Hall et al. 2018). Such intensive monitoring may open up new possibilities for stellar activity mitigation by providing larger samples of activity-related perturbations.

In this paper, we present AESTRA (Auto-Encoding STellar Radial-velocity and Activity), a deep-learning method for distinguishing between true Doppler shifts and activity-induced spectral perturbations, given a large sample of stellar spectra covering a variety of the star's activity states and Doppler shifts. The architecture of AESTRA combines a neural network for radial-velocity estimation and a spectrum autoencoder called spender (Liang et al. 2023; Melchior et al. 2023) for activity modeling. AESTRA is designed to disentangle pure Doppler shifts from apparent Doppler shifts due to spectral perturbations, without any prior knowledge of the star's spectrum or orbital motion.

The spectrum autoencoder architecture spender was originally built for analyzing and representing galaxy spectra in the presence of differing cosmological redshifts. It features an attentive autoencoder that scans over the wavelength range and compresses the information in each spectrum into a small number of parameters, which ideally should be independent of redshift (Liang et al. 2023), and a decoder capable of transforming the parameters back into an accurate reconstruction of the spectrum in the rest frame (i.e., without any redshift). Spender proved capable of compressing ∼500,000 galaxy spectra from the Sloan Digital Sky Survey (Ahumada et al. 2020) into a low-dimensional latent space while maintaining high fidelity in the reconstructed spectra (Liang et al. 2023; Melchior et al. 2023). Cosmological redshifts are many orders of magnitude larger than the Doppler shifts associated with exoplanets; for a Earth-like planet around a Sun-like star, z ∼ 10⁻¹⁰. Nevertheless, we wondered whether the ability of spender to identify and isolate intrinsic spectral features while disregarding Doppler shifts and other artifacts would be applicable to exoplanet detection.

The remainder of the paper is structured as follows. Section 2 describes the architecture of AESTRA and the loss functions that are used for training. Section 3 presents the application of AESTRA to simulated data sets. Finally, Section 4 summarizes our findings, addresses limitations and potential extensions of AESTRA, and discusses potential applications to solar spectroscopy and exoplanet surveys.

2. Methods

Figure 1 illustrates the architecture of AESTRA. The input consists of a collection of hundreds or more spectra of a single star, obtained over a range of different activity states and different phases of the orbital motion of any planets. The input is processed in parallel by a neutral network RV estimator and a spectrum autoencoder. The RV estimator learns the effect of a Doppler shift: a pure stretching of the wavelength scale, regardless of any other details of the spectrum. Training of the RV estimator network does not require a spectral template or line list, or indeed any prior knowledge about the star. Instead, we introduce artificial Doppler shifts into the input spectra and optimize the RV estimator network to recover these shifts. Within the RV estimator are filters (convolutional kernels) that sweep across the residual spectrum. Ideally, during training, the parameters of these filters should be tuned to recognize the patterns that result from Doppler shifts, while also learning to ignore photon noise. Thus, this part of AESTRA offers an alternative approach to traditional RV extraction methods involving template matching and cross-correlation functions.

**Figure 1.** `AESTRA` combines a convolutional RV estimator, whose output is denoted v_encode, and an attentive autoencoder that represents activity-induced spectral features with a latent vector s . The decoder network expands s into a rest-frame spectrum y _act, which is shifted into the observed frame to produce the reconstructed spectrum ${{\boldsymbol{y}}}_{\mathrm{obs}}^{{\prime} }$ . The architecture is trained with a fidelity loss function that ensures ${{\boldsymbol{y}}}_{\mathrm{obs}}\approx {{\boldsymbol{y}}}_{\mathrm{obs}}^{{\prime} }$ and an RV-consistency loss function that seeks to recover v_offset from an artificially Doppler-shifted "augment" spectrum y _aug. The combined loss function is designed to disentangle activity-induced spectral features from those induced by Doppler shift.
Download figure:
Standard image High-resolution image

This RV estimator, trained to recognize pure wavelength stretching, can accurately determine velocity differences between stellar spectra exhibiting the same activity state. However, when measuring RV differences between spectra of different activity states, the RV estimator is subject to biases due to spectral distortions that mimic Doppler shifts. In the absence of any "ground truth" about the star and its motion, we rely on a key assumption: the star's orbital motion is uncorrelated with its intrinsic variability during the observation period. Assuming this condition holds, we can mitigate the activity-induced RV bias by decorrelating the RV estimator output against activity indicators. These indicators should capture the star's activity state while remaining unaffected by Doppler shifts.

In this paper, we introduce an autoencoder network to generate these activity indicators through spectrum modeling. The encoder compresses the information in each stellar spectrum into a small number of "latent parameters," and the decoder transforms a set of latent parameters into an accurate reconstruction of a rest-frame stellar spectrum. The architecture is designed to assign latent parameters based on spectral perturbations that are independent of the Doppler shift of the spectrum. In this respect, the latent parameters are generalizations of traditional activity indicators such as the bisector span of the CCF or the equivalent width of certain emission lines. This detrending technique is explored in more depth in Section 2.3.

2.1. Architecture

We begin with a collection of normalized one-dimensional input spectra. Each spectrum is represented by a vector ${{\boldsymbol{y}}}_{\mathrm{obs}}\in {{\mathbb{R}}}^{L}$ , giving the intensity at each of L wavelengths λ_obs. We do not model the Earth's barycentric motion, thereby implicitly assuming that the barycentric corrections have been applied without error. To prepare the input for AESTRA, we first establish a template spectrum, ${{\boldsymbol{b}}}_{\mathrm{obs}}\in {{\mathbb{R}}}^{L}$ , that captures the star's average spectral features. This template is subtracted from each observed spectrum to create a residual spectrum r _obs = ( y _obs − b _obs). The purpose of this step is to isolate the spectral time variations and reduce the dynamic range of the input. In our experiments, we somewhat arbitrarily chose one of the simulated spectra with the smallest applied perturbations to be the template. The exact choice of the template does not matter as long as it is representative of the typical spectrum. The only effect of small changes to the template is to apply small time-independent offsets to all of the residual spectra, which should not interfere with training.

After template subtraction, the residual spectra are processed through two pipelines: the encoder–decoder network and the RV estimator network.

The goal of the encoder–decoder network is to summarize spectral perturbations due to stellar activity, using a small number of parameters. The residual spectrum is compressed by a parameterized encoder f_θ( · ) into a latent vector ${\boldsymbol{s}}\in {{\mathbb{R}}}^{S}$ . Here, θ denotes the adjustable parameters in the encoder:

$\begin{eqnarray}&&{\boldsymbol{s}}={f}_{\theta }({{\boldsymbol{r}}}_{\mathrm{obs}}).\end{eqnarray} \tag{ 1 }$

Following the approach of Melchior et al. (2023), the encoder f_θ( · ) consists of three convolutional layers, an attention layer, and a three-layer Multi-Layer Perceptron (MLP). A more in-depth description of these machine-learning components can be found in Smith & Geach (2023). The attention layer is integrated to help the encoder focus on significant spectral features amidst varying positions and noise. In practice, the attention layer works by dividing the extracted feature matrix into two submatrices. The first represents the identified features across different wavelengths, while the second indicates the relative significance of these features (commonly called "attention weights"). By multiplying these matrices element-wise and then summing over wavelengths, we derive "attended" features that emphasize the most relevant spectral information.

Conceptually, we may regard the components of the latent vector as generalizations of traditional activity indicators, such as line widths or bisector spans. We chose a dimension S = 3 for the latent space as a balance between model complexity, which favors large S, and interpretability, which favors small S. This relatively low-dimensional latent space simplifies the interpretation of the latent vector as activity indicators. As an illustration of the meaning of the latent vector, Figure 2 shows the effect of perturbing each component, one at a time, using the model to be described in Section 3.3 In this case, s₁ appears to be related mainly to line widths, and s₂ and s₃ to line asymmetries.

**Figure 2.** Visualization of the effect of perturbing the latent vector based on a representative spectrum to be described in Section 3. The black curve shows the activity spectrum y _act generated from a particular choice of the latent vector s . The colored lines show the resulting activity spectrum after perturbing each component of the latent vector: perturbing s₁ (purple) appears to introduce line width changes, while perturbations in s₂ (blue) and s₃ (orange) produce features related to line asymmetries.
Download figure:
Standard image High-resolution image

The decoder function g_ϕ( · ) with adjustable parameters ϕ is another three-layer MLP that uses s to generate an activity spectrum y _act in the stellar intrinsic rest frame, represented as a vector of length $L^{\prime}$ with a slightly extended wavelength coverage ( $L^{\prime} \gt L$ ):

$\begin{eqnarray}&&{{\boldsymbol{y}}}_{\mathrm{act}}={g}_{\phi }({\boldsymbol{s}}).\end{eqnarray} \tag{ 2 }$

To encourage the decoder to focus on the activity-induced perturbations, we add a trainable rest-frame template spectrum ${{\boldsymbol{b}}}_{\mathrm{rest}}\in {{\mathbb{R}}}^{L^{\prime} }$ to the decoded activity spectrum to produce the complete rest-frame model:

$\begin{eqnarray}&&\begin{array}{l}{{\boldsymbol{y}}}_{\mathrm{rest}}={{\boldsymbol{y}}}_{\mathrm{act}}+{{\boldsymbol{b}}}_{\mathrm{rest}}.\end{array}\end{eqnarray} \tag{ 3 }$

The rest-frame template, b _rest, represents a hypothetical spectrum with zero activity ( y _act = 0). The rest-frame template is distinct from the observed-frame template described earlier. One may consider b _obs to be an initial guess and b _rest as an optimized model for the baseline stellar spectrum.

The RV estimation network h_ψ( · ), visualized in Figure 1, is a modified version of the encoder network. A key difference between the encoder and the RV estimation network is that the encoder includes an attention layer (Vaswani et al. 2017) that is designed to help the model identify the most important patterns of variability irrespective of wavelength shifts, while the RV estimator lacks such a layer.

The RV estimator begins with two convolution layers that apply convolutional filters to the input data that are trained to respond maximally to Doppler shifts. After each convolution layer, we apply a Parametric Rectified Linear Unit (PReLU) operation (He et al. 2015), which introduces nonlinearity to the network and helps to enhance the recognized features. Next, max pooling is applied, downsampling the features by retaining only the dominant features from each segment of the residual spectrum. This leaves us with a 64 × 20 matrix representing 64 output channels and 20 wavelength segments. A softmax operation is applied in the wavelength direction. This operation helps to quantify the relative importance to different segments. Then, the resulting matrix is flattened into a vector with 1280 elements. These elements are passed to an MLP with (128, 64, 32) nodes. The MLP learns to estimate Doppler shifts based on the feature vector summarized by the convolutional layers. The estimated RV is denoted v_encode:

$\begin{eqnarray}&&{v}_{\mathrm{encode}}={h}_{\psi }({{\boldsymbol{r}}}_{\mathrm{obs}}).\end{eqnarray} \tag{ 4 }$

Last, the rest-frame model spectrum y _rest is Doppler shifted using the value of v_encode determined by the RV estimator and interpolated to match the observed wavelengths ${\lambda }_{\mathrm{obs}}\in {{\mathbb{R}}}^{L}$ using cubic spline interpolation. This step accounts for the motion of the star relative to the observer and enables a direct end-to-end comparison between the observed and model spectra.

2.2. Loss Function

To ensure high-quality spectral reconstruction, we employ the end-to-end fidelity loss function of Melchior et al. (2023). This function quantifies the agreement between the input and reconstructed spectra, averaged over a batch of N spectra each of dimension L:

$\begin{eqnarray}&&\begin{array}{l}{L}_{\mathrm{fid}}=\displaystyle \frac{1}{{NL}}\displaystyle \sum _{i}^{N}{{\boldsymbol{w}}}_{i}\odot {\left({{\boldsymbol{y}}}_{\mathrm{obs},i}-{{\boldsymbol{y}}}_{\mathrm{obs},i}^{{\prime} }\right)}^{2}.\end{array}\end{eqnarray} \tag{ 5 }$

Here, y _i,obs denotes the ith input spectrum, ${{\boldsymbol{y}}}_{i,\mathrm{obs}}^{{\prime} }$ is the reconstructed spectrum, and w _i is the corresponding weight vector. The operation ⊙ represents element-wise multiplication, and the squaring of the difference is also performed element-wise.

To achieve consistent RV estimations, we introduce a novel loss term, L_RV. To encourage the RV estimator to focus exclusively on the signatures caused by Doppler shifting, we exploit data augmentation, a technique used in machine learning to increase the amount and diversity of training data by applying various transformations or manipulations to the original data set. In this case, we artificially apply a Doppler shift while preserving the spectral shape. Data augmentation allows us to generate very large samples for training the RV estimator:

$\begin{eqnarray}&&\begin{array}{l}{L}_{\mathrm{RV}}=\displaystyle \frac{1}{N}\displaystyle \sum _{i}^{N}\left[\displaystyle \frac{1}{{\sigma }_{v}^{2}}{\left({v}_{\mathrm{aug},i}-{v}_{\mathrm{obs},i}-{v}_{\mathrm{offset},i}\right)}^{2}\right]\\ {v}_{\mathrm{obs},i}={h}_{\psi }({{\boldsymbol{r}}}_{\mathrm{obs},i})\quad {v}_{\mathrm{aug},i}={h}_{\psi }({{\boldsymbol{r}}}_{\mathrm{aug},i}).\end{array}\end{eqnarray} \tag{ 6 }$

As shown in Equation (6), we use a weighted mean squared error (MSE) loss to compare the predicted RV offset between the original and augmented spectra (v_aug,i − v_obs,i) to the true injected offset (v_offset,i). During training, v_offset,i is randomly drawn from a uniform distribution ${v}_{\mathrm{offset}}\sim { \mathcal U }(-3,3)$ m s⁻¹, and σ_v represents the Cramer–Rao theoretical limiting uncertainty of a single RV measurement based only on photon noise (see, e.g., Bouchy et al. 2001; Beatty & Gaudi 2015). Because the RV estimator is trained on augmented spectra with artificial Doppler shifts, there is no need to split the data into training, validation, and test sets. During each iteration of training (commonly referred to as an "epoch" in the machine-learning literature), a new batch of training data is generated on the fly, ensuring that the model is consistently exposed to new Doppler shifts.

Last, we introduce a regularization term to constrain the decoder's flexibility and prevent overfitting. As both the activity spectrum and rest-frame baseline are trainable, adjusting b _rest by adding a constant can be compensated for by subtracting the same constant from y _act. To resolve this degeneracy between y _act and b _rest, we define a Ridge regularization loss as follows:

$\begin{eqnarray}&&\begin{array}{l}{L}_{\mathrm{reg}}=\displaystyle \frac{{k}_{\mathrm{reg}}}{{NL}}\displaystyle \sum _{i}^{N}\displaystyle \frac{{{\boldsymbol{y}}}_{\mathrm{act},i}^{2}}{{\sigma }_{y}^{2}}.\end{array}\end{eqnarray} \tag{ 7 }$

The adjustable hyperparameters k_reg and σ_y determine the relative weight of the regularization loss. In practice, we set σ_y = 0.1 to achieve a good balance between model flexibility and trainability. During optimization, we cyclically adjust the weight of regularization loss k_reg, increasing from 0 to 1 over a cycle of 1000 iterations, and then setting k_reg to 0 and restarting the cycle.

We adopted a two-phase strategy to train AESTRA models. In the first phase, the RV estimator was trained independently until L_RV approached 1, while leaving the encoder–decoder network and rest-frame baseline unchanged. In this stage, the RV estimator can accurately measure the velocity offset between spectra of the same activity state, but the RV estimates are still subject to activity-induced bias. In the second phase, we jointly optimized all components of AESTRA—the encoder–decoder network, the rest-frame baseline, and the RV estimator—using the combined loss function L_total:

$\begin{eqnarray}&&{L}_{\mathrm{total}}={L}_{\mathrm{fid}}+{L}_{\mathrm{RV}}+{L}_{\mathrm{reg}}.\end{eqnarray} \tag{ 8 }$

In summary, L_total is designed to optimize three key aspects during training: high reconstruction quality, consistent RV estimation, and shift-invariant encoding. The fidelity loss, L_fid, ensures that the reconstructed spectra closely match the input spectra. The RV loss, L_RV, promotes consistency in RV estimations by training the estimator to focus on signatures induced by stretching, using augmented data. Last, the regularization loss, L_reg, constrains the decoder's flexibility and resolves the degeneracy between the activity spectrum and rest-frame baseline, improving training efficiency.

As an aside, we initially thought it would be necessary to add a consistency loss term to encourage shift-invariant encoding, following Liang et al. (2023). This term aims to assign the same latent vector to spectra with identical activity-induced perturbations but varying Doppler shifts by minimizing the sigmoid-modulated difference between the latent positions of the original and corresponding augmented spectra:

$\begin{eqnarray}&&\begin{array}{l}{L}_{{\rm{c}}}=\displaystyle \frac{1}{N}\displaystyle \sum _{i}^{N}\mathrm{sigmoid}\left[\displaystyle \frac{1}{{\sigma }_{s}^{2}S}| {{\boldsymbol{s}}}_{i}-{{\boldsymbol{s}}}_{\mathrm{aug},i}{| }^{2}\right]-0.5.\end{array}\end{eqnarray} \tag{ 9 }$

However, we found that this additional term was not necessary for the numerical experiments we have performed. To demonstrate, we used AESTRA trained without a consistency loss term to encode augmented spectra with artificial Doppler shifts. As shown in Figure 3, the latent distance between the original and corresponding augmented spectra (red) exhibit a much lower dispersion than the latent distances between randomly selected data pairs (blue), indicating that the latent parameters that AESTRA assigns to Doppler-shifted spectra depend only weakly on the Doppler shift. We note this fact here because, although the consistency loss term was unnecessary, it does not harm model performance and might be useful when training on real data, for which spectral perturbations are undoubtedly more complex.

**Figure 3.** Latent distance distribution for randomly selected data pairs (Δ s _rand, blue), and for data-augment pairs (Δ s _aug, red), in a data set of 1000 simulated spectra influenced by random activity. Augmented spectra are obtained by artificially Doppler-shifting the corresponding original spectra. For a Doppler-shift-invariant encoder, the expected value of the augmented latent distances is zero: 〈Δ s _aug〉 = 0.
Download figure:
Standard image High-resolution image

2.3. Detrending RVs with Latent Vectors

The star's RV estimates ideally should be independent of the spectral perturbations. However, Figure 4 shows that the encoded RVs are mildly correlated with latent positions. Causal connections between activity and orbital motion are unlikely; although hot Jupiters and other close-orbiting planets may influence a star's rotation and photospheric activity through tidal or magnetic interactions, such effects are negligible for habitable-zone exoplanets, the focus of this study. Nevertheless, even if causal correlations are absent, Doppler shifts and stellar perturbations may be correlated because of similarities in timescales—if, for example, the planet's orbital period is comparable to the star's rotation period or the period of a long-term magnetic activity cycle. To disentangle these effects, data should be collected in sufficient quantities and over all relevant timescales.

But we suspect another effect is responsible for the mild trend observed in Figure 4. The RV estimator is trained to determine accurate RV offsets from pure spectral stretching, leaving the RV zero points unspecified. The encoded RVs may be affected by stellar activity and thus contain activity-dependent RV zero points v₀ that can bias our RV estimates. However, we have a trained activity autoencoder, capable of producing high-quality reconstructions with varying stellar activity. We thus consider the latents s as generalized activity indices and now seek to determine the spurious velocity zero points. We make the ansatz:

$\begin{eqnarray}&&{v}_{\mathrm{encode},i}={v}_{\mathrm{true},i}+{v}_{0}({{\boldsymbol{s}}}_{i})+\mathrm{noise}.\end{eqnarray} \tag{ 10 }$

Assuming that we have enough spectra to observe different true RVs at approximately fixed activity s , we can isolate a constant baseline:

$\begin{eqnarray}&&\begin{array}{l}{v}_{0}({{\boldsymbol{s}}}_{i})\approx \langle v{\rangle }_{i}=\displaystyle \frac{{\sum }_{j\ne i}{w}_{{ij}}{v}_{\mathrm{encode},j}}{{\sum }_{j\ne i}{w}_{{ij}}},\\ \mathrm{where}\,{w}_{{ij}}=\exp \left[-\displaystyle \frac{| {{\boldsymbol{s}}}_{i}-{{\boldsymbol{s}}}_{j}{| }^{2}}{2{\sigma }_{R}^{2}}\right].\end{array}\end{eqnarray} \tag{ 11 }$

Here, w_ij denotes the weight of the encoded RV v_encode,j for neighbor j, and σ_R denotes the characteristic radius of the latent distribution. We calculate σ_R as the average distance among the 10 nearest neighbors in the distribution. This calculation is mathematically equivalent to Gaussian smoothing in latent space. In theory, each point in the distribution should have a nonzero weight and be included in the calculation, but samples farther than 3σ_R exert negligible effects. In practice, the summation is carried out over the 5% nearest neighbors in latent space, with a minimum of 10 nearest samples.

To obtain our final estimate of the true Doppler RVs, we subtract the activity-dependent RV zero points from the encoded RVs:

$\begin{eqnarray}&&{v}_{\mathrm{correct},i}={v}_{\mathrm{encode},i}-\langle v{\rangle }_{i}.\end{eqnarray} \tag{ 12 }$

The resulting corrected RVs, v_correct,i, should be closer to the true Doppler RVs, v_true,i. We expect the efficacy of this method to increase with larger sample sizes, because we can observe the star with at different Doppler shifts in nearly the same activity state.

3. Applying `AESTRA` to Simulated Data

3.1. Generating Synthetic Spectra from SOAP CCFs

To assess the performance of AESTRA, we simulated a large number of synthetic spectra spectra affected by magnetic activity.

First, we generated a "quiet" spectrum, y _quiet, without any activity perturbations or Doppler shift. We used an evenly sampled wavelength grid ranging from 5000 to 5050 Å with a wavelength spacing of 0.025 Å,³ resulting in an observed wavelength vector λ_obs of length L = 2000.

To produce realistic absorption lines, we randomly selected 70 line locations within the observed wavelength range. For each line, we determined a random width from a Gaussian distribution r ∼ N(μ = 0.2Å, σ = 0.02Å) and a random amplitude from a uniform distribution a ∼ U(0.01, 0.70). We then calculated the quiet spectrum as the product of all absorption lines:

$\begin{eqnarray}&&\begin{array}{l}{{\boldsymbol{y}}}_{\mathrm{quiet}}=\displaystyle \prod _{k}^{K}\left(1-{a}_{k}\exp \left[-\displaystyle \frac{1}{2}\displaystyle \frac{{\left({\lambda }_{\mathrm{obs}}-{\lambda }_{k}\right)}^{2}}{{r}_{k}^{2}}\right]\right).\end{array}\end{eqnarray} \tag{ 13 }$

The spectral deformations are based on the SOAP 2.0 code (Spot Oscillation And Planet; Dumusque et al. 2014). Since SOAP 2.0 only calculates the impact of spots and plages on the CCF, rather than the entire spectrum, we needed to devise a method to perturb the synthetic spectrum using the CCFs.

We define CCF_quiet as the quiet CCF of a Sun-like star evaluated at velocity offsets Δ v _CCF, and CCF_active as the CCF affected by activity. We introduced spectral perturbations to the simulated spectrum by perturbing all the absorption lines using the same CCF_active. At each epoch, CCF_active was calculated by randomly placing four active regions on the star's surface. The activity type (spot/plage), size, and phase of each active region were randomly drawn from the priors provided in Table 1. For each realization of CCF_active, we defined an activity function f_act to measure the activity-induced perturbation as a function of velocity offset:

$\begin{eqnarray}&&\begin{array}{l}{f}_{\mathrm{act}}({\rm{\Delta }}{{\boldsymbol{v}}}_{\mathrm{CCF}}/c)=\mathrm{CubicSpline}\left(\displaystyle \frac{{\rm{\Delta }}{{\boldsymbol{v}}}_{\mathrm{CCF}}}{c},\displaystyle \frac{{\mathrm{CCF}}_{\mathrm{active}}}{{\mathrm{CCF}}_{{\rm{quiet}}}}\right).\end{array}\end{eqnarray} \tag{ 14 }$

Table 1. Priors of the SOAP Active Region Parameters

Parameter	Distribution	Prior
Activity type	Binary	P(Spot) = P(Plage) = 0.5
Size [R_⋆]	Gaussian	μ = 0.10, σ = 0.05
Longitude [°]	Uniform	0° ≤ ϕ < 360°
Latitude [°]	Uniform	−30° ≤ θ ≤ 30°

Download table as: ASCII Typeset image

Figure 5 shows a collection of CCF_active/CCF_quiet due to a single active region. To perturb a specific line, we aligned the activity function f_act with the line center λ_k and interpreted the wavelength offset as a velocity offset, given by Δλ/λ_k = Δ v _CCF/c. By incorporating the individual line perturbations, we constructed the entire intrinsic spectrum:

$\begin{eqnarray}&&\begin{array}{l}{{\boldsymbol{y}}}_{\mathrm{intrinsic}}={{\boldsymbol{y}}}_{\mathrm{quiet}}\odot \displaystyle \prod _{k}^{K}{f}_{\mathrm{act}}\left(\displaystyle \frac{{\lambda }_{\mathrm{obs}}-{\lambda }_{k}}{{\lambda }_{k}}\right).\end{array}\end{eqnarray} \tag{ 15 }$

**Figure 5.** CCF ratios (CCF_active/CCF_quiet) vs. velocity offset (Δ v _CCF) for a solar-type star with a single active region, calculated by the SOAP 2.0 code. The color of each curve indicates the brightness variations due to the darkening effect of the active region at different phases of stellar rotation. The left panels depict the effect of spots with varying sizes and latitudes, and the right panels show the impact of plages. The gray shaded area represents the approximate photon noise level corresponding to a single-spectrum RV precision of 0.3 m s⁻¹.
Download figure:
Standard image High-resolution image

For the recovery test, we Doppler-shifted the perturbed intrinsic spectrum according to the true RV and interpolated it onto the grid of observed wavelengths using a cubic spline function. To simulate photon noise, we added independent Gaussian random numbers to the spectrum:

$\begin{eqnarray}&&\begin{array}{l}{{\boldsymbol{y}}}_{\mathrm{obs}}=\mathrm{DopplerShift}({{\boldsymbol{y}}}_{\mathrm{intrinsic}},{v}_{\mathrm{true}})+\mathrm{noise}.\end{array}\end{eqnarray} \tag{ 16 }$

The true planetary RV signal at time t was taken to be ${v}_{\mathrm{true}}(t)=K\sin \left(2\pi t/P\right)$ , where K is the velocity semi-amplitude and P is the orbital period. For the following tests, we adjusted the photon-noise level of the simulated spectra to correspond to a Cramer–Rao theoretical limit of 0.3 m s⁻¹ on the precision of a single RV measurement. The value of 0.3 m s⁻¹ was chosen to be comparable to the anticipated capabilities of frontier EPRV instruments.

3.2. Defining RV Estimates for Model Assessment

To evaluate the performance of the AESTRA model, we track four radial velocity estimates.

Apparent RV (v_apparent): The apparent RV is computed by finding the Doppler shift that brings the quiet baseline spectrum and the observed spectrum into the best possible agreement (minimum χ²). Affected by stellar activity, this metric provides insight into the RV noise level due to line distortions.

Detrended RV using traditional activity indicators (v_traditional): This quantity represents the traditional technique for activity mitigation whereby the apparent RVs are corrected to control for the observed variations in activity indicators that are derived from the spectra. We use a quiet spectrum ( y _quiet) as a template to extract the CCF of the observed, activity-perturbed spectrum ( y _obs). Three activity indicators are measured: the bisector span, the full width at half maximum (FWHM) of the CCF, and the depth of the CCF. Following Dumusque et al. (2014), the bisector span is defined as the difference between the top (10%–40% depth) and bottom (60%–90% depth) bisectors. Multilinear regression is performed between the time series of v_apparent and the time series of the three activity indicators, and the results are used to subtract the estimated contributions to the RV variations due to changes in the activity indicators.

Corrected RV (v_correct): This quantity is derived from the AESTRA model by detrending the output of the RV estimator (v_encode) against the components of the latent vector, i.e., the generalized activity indicators (see Section 2.3 for details). These values can be used to assess the model's performance in isolating true Doppler RVs from activity-induced RV offsets.

Reference RV (v_ref): Serving as a benchmark, the reference RV is calculated by fitting the noise-free underlying rest-frame spectrum ( y _intrinsic) to the observed spectrum ( y _obs). The scatter in the reference RVs around the true RVs is slightly higher than the Cramer–Rao bound of 0.30 m s⁻¹, probably due to interpolation imperfections. This serves as a reference for the best possible RV precision in the absence of stellar activity.

For the ith input spectrum, let v_i denote the RV measurement and v_true,i denote the true planetary RV. We define RV scatter as the standard deviation of the residual RV (Δv_i = v_i − v_true,i):

$\begin{eqnarray}&&{({\rm{\Delta }}v)}_{\mathrm{std}}\equiv \sqrt{\tfrac{1}{{N}_{\mathrm{spec}}}\displaystyle \sum _{i}{\left({\rm{\Delta }}{v}_{i}-{({\rm{\Delta }}v)}_{\mathrm{avg}}\right)}^{2}}.\end{eqnarray} \tag{ 17 }$

To assess the model's performance, we compare v_correct to both v_traditional and v_ref. An effective model should yield values for v_correct that approach the benchmark values of v_ref, and an advantageous model should reduce the RV scatter below the scatter in v_traditional.

3.3. Case I: Time-independent Activity

Our overall goal was to see if AESTRA could recover 0.10 m s⁻¹ Earth-like signals in the presence of RV jitter that is approximately 30 times larger in amplitude. First, we considered spectral perturbations that are randomly and independently assigned to each spectrum. We generated 1000 synthetic spectra using SOAP CCFs, with each spectrum affected by four active regions randomly drawn from the priors in Table 1, without any temporal structure imposed. The apparent RVs, derived by fitting a quiet baseline to the simulated spectra, are predominantly influenced by random stellar activity.

We trained an AESTRA model using a two-phase strategy as described in Section 2.2.⁴ Table 2 summarizes the training results, and Figure 6 illustrates an example spectrum and its reconstruction. The AESTRA model achieved high reconstruction quality and RV consistency, with L_fid ≈ 1 and L_RV < 1. The fact that the self-reported RV precision was better than the photon noise limit suggests that the RV estimator performed well and was not the limiting factor in the RV determinations. Instead, the achievable RV precision was mainly limited by the decorrelation process.

**Figure 6.** Simulated spectrum of a star affected by multiple active regions and photon noise ( y _obs, gray), its reconstruction ( ${\boldsymbol{y}}{{\prime} }_{\mathrm{obs}}$ , red), and the noise-free underlying spectrum (black). In the top panel, y _obs and ${\boldsymbol{y}}{{\prime} }_{\mathrm{obs}}$ are difficult to distinguish because they overlap nearly perfectly. The bottom panel displays the corresponding residual spectra, with the gray spectrum predominantly showing photon noise.
Download figure:
Standard image High-resolution image

**Figure 7.** Top: Time series of apparent RV (black, left y-axis) and the fraction of active area on the stellar surface (orange, right y-axis), dominated by random stellar activity. Middle: The corrected model RVs (black) are compared to the 0.1 m s⁻¹ true planetary signal (red). Bottom: The phase-folded and rebinned model RVs (black) are in good agreement with the true planetary signals (red).
Download figure:
Standard image High-resolution image

Table 2. Model Performance on Simulated Data Sets with Random Stellar Activity (Case I) and Starspot Evolution (Case II)

Model	N _spec	K [m s⁻¹]	L _fid	L_RV	L_reg
Case I	1000	0.10	1.007	0.587	0.025
Case I	100	0.30	0.973	0.559	0.034
Case II	200	0.30	0.982	0.604	0.025

Note. L_reg is evaluated assuming k_reg = 1.

Download table as: ASCII Typeset image

Table 3 summarizes the performance of different methods for extracting RVs from the simulated data set of 1000 spectra. The method involving traditional activity indicators managed to reduce the apparent RV scatter from 3.38 to 0.98 m s⁻¹. The AESTRA model achieved a further reduction of the RV scatter to 0.46 m s⁻¹, closely approaching the RV benchmark for this data set, as illustrated in Figure 7. Figure 8 presents Lomb–Scargle periodograms of the RVs detrended by traditional activity indicators (v_traditional) and those corrected by AESTRA (v_correct). While the traditional analysis yields a peak at the true period, there are also spurious peaks with similar amplitudes. The periodogram of the AESTRA-corrected RVs shows a more prominent peak, allowing for a clearer detection of 0.10 m s⁻¹ planetary signal.

**Figure 8.** Top: Lomb–Scargle periodogram of the apparent RVs (v_apparent, without any correction) affected by four random active regions (spots and plages). Middle: Detrended RVs using traditional indicators (v_traditional) reveal a subtle peak at the true period (red dashed line). Bottom: `AESTRA-`corrected RVs (v_correct) show a prominent peak at the correct period.
Download figure:
Standard image High-resolution image

Table 3. RV Estimates and Parameter Constraints through MCMC Sampling

	Case I		Case II
RV [m s⁻¹ ]	N = 1000	N = 100	N = 200
${({\rm{\Delta }}{v}_{\mathrm{apparent}})}_{\mathrm{std}}$	3.38	3.27	3.51
${({\rm{\Delta }}{v}_{\mathrm{traditional}})}_{\mathrm{std}}$	0.98	0.96	0.98
${({\rm{\Delta }}{v}_{\mathrm{correct}})}_{\mathrm{std}}$	0.46	0.73	0.44

Reference RV

${({\rm{\Delta }}{v}_{\mathrm{ref}})}_{\mathrm{std}}$	0.39	0.38	0.41

Parameters

Period [day]	${100.0}_{-0.26}^{+0.27}$	${115.1}_{-7.8}^{+7.6}$	${101.9}_{-1.3}^{+1.3}$
K [m s⁻¹ ]	${0.12}_{-0.01}^{+0.01}$	${0.36}_{-0.07}^{+0.08}$	${0.29}_{-0.03}^{+0.03}$
Phase [°]	${5.0}_{-12.0}^{+12.0}$	${70.0}_{-27.0}^{+25.0}$	${37.0}_{-13.0}^{+13.0}$

Download table as: ASCII Typeset image

To further evaluate the model's performance in recovering planetary signals, we performed Markov Chain Monte Carlo (MCMC) sampling using the emcee package (Foreman-Mackey et al. 2013) to retrieve the orbital parameters from the corrected RV v_correct. Currently, we set the uncertainty for model RVs to be the same as the RV scatter ${({\rm{\Delta }}{v}_{\mathrm{correct}})}_{\mathrm{std}}$ , as obtaining principled uncertainties directly from AESTRA remains a task for the future. We intend to explore alternative approaches, such as noise propagation through the network or least-squares fit error, in the future development of AESTRA.

Assuming circular orbits for simplicity, we assigned uniform priors for the orbital period between 10 and 200 days, semi-amplitude between 0.0 and 10.0 m s⁻¹, and phase between −180 and 180°. We initialized 32 walkers near the maximum a posteriori (MAP) solution and ran the MCMC sampler for 10,000 steps with a 2000 step burn-in. As shown in Figure 9, the RVs measured by AESTRA constrained the orbital period to be 100.0 ± 0.27 days and the signal amplitude to be 0.12 ± 0.01 m s⁻¹. The full parameter constraints are summarized in Table 3.

**Figure 9.** Corner plot illustrating the posterior distribution of the orbital parameters for 1000 synthetic spectra affected by random activity, with truth values indicated by blue lines.
Download figure:
Standard image High-resolution image

To assess AESTRA's performance in recovering planetary signals with different amplitudes, we created new data sets using the same activity perturbations but varying injected signal amplitudes (K values). The model, trained on the original data set with 0.1 m s⁻¹ planetary signals, was applied to recover the signal amplitudes in the new data sets. Figure 10 shows that the model accurately recovered weaker signals and slightly underestimated amplitudes > 1 m s⁻¹ . This probably occurred because planetary RVs are treated as noise during detrending, causing stronger signals to introduce more uncertainty in activity-dependent RVs.

**Figure 10.** Recovered vs. true planetary signal amplitudes. The model accurately recovers small-amplitude planetary signals but tends to underestimate amplitudes > 1 m s⁻¹.
Download figure:
Standard image High-resolution image

Deep-learning methods generally require large data sets, making it important to understand how AESTRA performance scales with sample size. We generated a smaller data set of 100 spectra affected by random stellar activity and injected a planetary signal of 0.30 m s⁻¹, comparable to the RV scatter caused by pure photon noise. The apparent RV scatter in the N_spec = 100 data set was 3.27 m s⁻¹. We followed the same procedure to train the AESTRA model. The training performance and resulting RV estimates are detailed in Tables 2 and 3.

For the 100 spectrum simulated data set, the best-fit AESTRA model reduced the RV scatter from 3.27 to 0.73 m s⁻¹, lower than the scatter of 0.96 m s⁻¹ achieved by the method involving traditional activity indicators. While the improvement was not as good as it was in the 1000 spectrum data set, the AESTRA-corrected RVs still allowed for a 5σ detection of the injected planetary signal and gave a correct estimate of the RV semi-amplitude, $K={0.36}_{-0.07}^{+0.08}$ m s⁻¹.

We believe that, for small sample sizes, the primary source of RV uncertainty in v_correct arises from the decorrelation process, which relies on Gaussian smoothing in the latent space to estimate the activity-dependent RV zero points. If the latent space is underpopulated, i.e., if a given type of stellar perturbation is not observed at different Doppler shifts, it becomes challenging to calibrate the RV zero points. This highlights the importance of capturing potential stellar activity perturbations through repeated observations.

3.4. Case II: Time-dependent Activity

Next, we performed simulations in which stellar activity exhibited realistic temporal correlations. We generated a data set of 200 synthetic spectra affected by evolving starspots using the SOAP 2.0 code, incorporating the time-structured nature of starspot growth, decay, and rotation, as detailed below. A planetary signal with amplitude 0.30 m s⁻¹ was injected into the data set for recovery testing.

As before, we placed four spots on the stellar surface, with new spots generated at random locations as old spots decayed. Each spot's size evolved over time, following a Gaussian function:

$\begin{eqnarray}&&\begin{array}{l}S(t)={S}_{\max }\exp \left[-\displaystyle \frac{1}{2}{\left(\displaystyle \frac{t-{t}_{\mathrm{peak}}}{\tau }\right)}^{2}\right],\end{array}\end{eqnarray} \tag{ 18 }$

where S(t) represents the spot's size at time $t,{S}_{\max }$ is the maximum spot size, chosen randomly between 0.5% and 20% stellar radius, t_peak is the time when the spot reaches its maximum size, and τ is the spot's lifetime, determined by $\tau \propto {S}_{\max }^{2}$ , with ${\tau }_{\max }\sim {P}_{\mathrm{rot}}$ for the largest spots. The spots' longitudes and latitudes were randomly chosen based on the priors described in Table 1. The growth and decay of starspots results in distinct time structures in the apparent RVs, as shown in the top panel of Figure 11.

**Figure 11.** Top panel: Time series of apparent RVs (black, left y-axis) and the percentage of the stellar surface covered by active regions (orange, right y-axis). Middle panel: Corrected model RVs (black) compared to injected Doppler RVs (red). Bottom panel: Phase-folded, rebinned corrected model RVs (black error bars), showing good agreement with the true RVs (red).
Download figure:
Standard image High-resolution image

Training the AESTRA model on this data set led to a reduction in the apparent RV scatter from 3.51 to 0.44 m s⁻¹, as presented in Table 3. The AESTRA-based RV scatter of 0.44 m s⁻¹ is about one-half of the scatter obtained with the traditional method, and leads to a solid detection of the 0.3 m s⁻¹ planetary signals in the phase-folded RVs shown in Figure 11.

Both the AESTRA model and the traditional analysis identified significant peaks at the true planetary period in their Lomb–Scargle periodograms (Figure 12). However, the traditional analysis gave K = 0.52 ± 0.09 m s⁻¹, an overestimate of the true RV semi-amplitude. The AESTRA model gave a correct estimate, K = 0.29 ± 0.03 m s⁻¹.

**Figure 12.** Lomb–Scargle periodograms illustrating the impact of complex spot evolution on apparent RVs (top), RVs detrended using traditional activity indicators (middle), and the `AESTRA`-corrected RVs (bottom). The true planetary period (red) is detected by both methods.
Download figure:
Standard image High-resolution image

As summarized in Table 3, when the AESTRA RVs were fitted with a circular-orbit model using an MCMC method, the posteriors for the orbital parameters overlapped with the true values, despite the presence of complex and time-structured stellar activity. These results also underscore the importance of good temporal coverage for robust RV estimation.

4. Conclusions and Outlook

AESTRA is a deep-learning method designed to separate Doppler shifts from spectral perturbations due to stellar activity. It operates on a collection of spectra of the same star rather than a single spectrum or a cross-correlation function. The autoencoder network summarizes spectral perturbations using a small number of latent parameters, yielding a compact representation that can be utilized as generalized activity indices. The RV estimator network is jointly trained with the autoencoder to determine RV offsets. Training is performed using a combined loss function that balances the various objectives, resulting in a model that provides rest-frame spectral reconstruction and RV estimation.

In our simulations, AESTRA successfully detected planetary signals with amplitudes as low as 0.10 m s⁻¹, despite activity-induced perturbations with amplitudes of 3 m s⁻¹ and photon noise of 0.30 m s⁻¹. Additionally, AESTRA showed encouraging performance even on data sets consisting of as few as 100 spectra. The model recovered planetary signals in a series of simulated spectra dominated by time-dependent stellar activity, even without utilizing time-domain information.

While AESTRA has shown promising results on synthetic spectra, there is no guarantee that it will be effective on real data. We plan to apply AESTRA to analyze solar spectra obtained with instruments such as HARPS-N (Dumusque et al. 2015), NEID (Lin et al. 2022), and EXPRES (Petersburg et al. 2020). This validation step will assess the model's real-world applicability and will undoubtedly reveal limitations or biases that are not apparent in synthetic data. This may provide valuable insights into solar variability across different timescales and set the stage for applications to other stars, with and without known planets.

AESTRA is at an early stage of development. We intend to explore some potential extensions to improve the model's performance. Currently, AESTRA treats each spectrum independently, not considering the times of observation. In our tests, AESTRA performed well in the presence of stellar activity perturbations that were correlated in time (Case II), but we expect that it would be even better if AESTRA took time-domain information into account. The model could learn the temporal behavior of long-lived starspots, differential rotation, and acoustic oscillations, which exhibit distinct timescales. Future iterations of the AESTRA model could benefit from integrating a continuous latent space time-series model and applying physically motivated constraints on the timescales associated with different types of activity.

More broadly, the current version of AESTRA is a general model that does not incorporate any physical knowledge about stellar atmospheres or even any built-in concept of a spectral absorption line. It seems likely that integrating known physical processes and relationships that affect stellar spectra into AESTRA will help it cope with more challenging and realistic data sets. For example, future iterations could make use of models for telluric absorption lines and for variations in the point-spread function of the spectrograph.

If AESTRA can be successfully validated using solar data, perhaps after incorporating some of the improvements described above, the stage will be set for applications to exoplanet data. We will explore AESTRA's potential in improving exoplanet characterization, analyzing low signal-to-noise candidates, and searching for Earth-like planets.

As we explore AESTRA's potential in our future work, the code will be made publicly available on GitHub (https://github.com/yanliang-astro/aestra), including a frozen version tagged "paper-I" for reproducing the results of this work.

Acknowledgments

We thank Andrew Howard and David Hogg for constructive feedback and support.

Software: spender (Melchior et al. 2023), SOAP (Dumusque et al. 2014), emcee (Foreman-Mackey et al. 2013), astropy (Robitaille et al. 2013; Astropy Collaboration et al. 2018, 2022).

AESTRA: Deep Learning for Precise Radial Velocity Estimation in the Presence of Stellar Activity

Article metrics

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction