The CARMENES search for exoplanets around M dwarfs A deep transfer learning method to determine T eff and [M/H] of target stars ⋆

The large amounts of astrophysical data being provided by existing and future instrumentation require efficient and fast analysis tools. Transfer learning is a new technique promising higher accuracy in the derived data products, with information from one domain being transferred to improve the accuracy of a neural network model in another domain. In this work, we demonstrate the feasibility of applying the deep transfer learning (DTL) approach to high-resolution spectra in the framework of photospheric stellar parameter determination. To this end, we used 14 stars of the CARMENES survey sample with interferometric angular diameters to calculate the effective temperature, as well as six M dwarfs that are common proper motion companions to FGK-type primaries with known metallicity. After training a deep learning (DL) neural network model on synthetic PHOENIX-ACES spectra, we used the internal feature representations together with those 14+6 stars with independent parameter measurements as a new input for the transfer process. We compare the derived stellar parameters of a small sample of M dwarfs kept out of the training phase with results from other methods in the literature. Assuming that temperatures from bolometric luminosities and interferometric radii and metallicities from FGK+M binaries are sufficiently accurate, DTL provides a higher accuracy than our previous state-of-the-art DL method (mean absolute differences improve by 20 K for temperature and 0.2 dex for metallicity from DL to DTL when compared with reference values from interferometry and FGK+M binaries). Furthermore, the machine learning (internal) precision of DTL also improves as uncertainties are five times smaller on average. These results indicate that DTL is a robust tool for obtaining M-dwarf stellar parameters comparable to those obtained from independent estimations for well-known stars.


Introduction
The determination of photospheric stellar parameters in M dwarfs has always been challenging. M dwarfs are smaller, cooler, and fainter than Sun-like stars. Because of their faint-ness and their higher stellar activity, with sometimes stronger magnetic fields, stronger line blends, and the lack of true continuum, well-established photometric and spectroscopic methods are brought to their limits. In the literature, there are several methods to estimate M-dwarf photospheric parameters, such as effective temperature (T eff ), surface gravity (log g), and metallicity ([M/H]); for example, spectroscopic indices (see Rojas-Ayala et al. 2012;, photometric relations (see Dittmann et al. 2016;Houdebine et al. 2019), interferometry (see Boyajian et al. 2012;von Braun et al. 2014;Rabus et al. 2019), synthetic model fits (see Passegger et al. 2018;Marfil et al. 2021), and machine learning (ML; see Antoniadis-Karnavas et al. 2020;Passegger et al. 2020).
One method considered to be relatively precise is calibration with M dwarfs that have a late F, G, or early K common proper motion companion with known metallicity. Many of the relations mentioned below were calibrated using FGK+M multiple systems (e.g., Newton et al. 2014). As a representative example, Mann et al. (2013a) identified spectral features sensitive to metallicity in low-resolution optical and near-infrared (NIR) spectra of 112 late-K to mid-M dwarfs in multiple systems with earlier companions, from which they derived different metallicity calibrations. The same relations were used by Rodríguez Martínez et al. (2019) to determine metallicity from mid-resolution K-band spectra for 35 M dwarfs of the K2 mission. Other photometric calibrations using FGK+M binary systems were presented by Bonfils et al. (2005), Casagrande et al. (2008), Johnson & Apps (2009), Schlaufman & Laughlin (2010), and Neves et al. (2012), among others, while several spectroscopic calibrations were explored by Rojas-Ayala et al. (2010), Dhital et al. (2012), Terrien et al. (2012), Mann et al. (2014), Mann et al. (2015), and, more recently, Montes et al. (2018).
Fundamental stellar parameters can also be derived from interferometric measurements. However, only a limited number of late-type dwarfs are accessible for such observations because they must be bright and nearby. Boyajian et al. (2012) presented interferometric angular diameters for 26 K and M dwarfs measured with the CHARA array and for 7 K and M dwarfs from the literature. With parallaxes and bolometric fluxes, these authors computed the absolute luminosity (L), radii (R), and T eff . They also calculated empirical relations for K0 to M4 dwarfs to connect T eff , R, and L to a broadband color index and iron abundance [Fe/H]. On the other hand, Maldonado et al. (2015) estimated T eff from pseudo-equivalent widths (pEWs) of temperature-sensitive lines calibrated with interferometric T eff from Boyajian et al. (2012) and metallicities from pEWs calibrated with the relations of Neves et al. (2012). Maldonado et al. (2015) constructed a mass-radius relation using interferometric radii von Braun et al. 2014) and masses from eclipsing binaries (Hartman et al. 2015). From this, they calculated log g. Other studies that derived M-dwarf T eff from angular diameters include, for example, those of Ségransan et al. (2003), Demory et al. (2009), von Braun et al. (2014, and Newton et al. (2015). Of these, Ségransan et al. (2003) also determined log g from their measured masses and radii.
Different approaches have been taken to estimate photospheric stellar parameters for M dwarfs in general, mainly within the paradigm of comparing measured line fluxes with theoretical ones calculated from different sets of synthetic spectra. Although different algorithms using χ 2 -minimization or principal component analysis have been employed, several artificial intelligence techniques have also been proposed. For these, the differences between observation and theory are not at the level of individual lines, but are based on whole spectral regions Kielty et al. 2018;Bialek et al. 2020;Minglei et al. 2020;Passegger et al. 2020). Indeed, some comparisons of techniques regarding the estimation of stellar parameters have already been carried out (Passegger et al. 2022). However, there are still several open questions related to the uncertainty of pa-  rameter estimation due to the signal-to-noise ratio (S/N) of the flux signal, and due to the "synthetic gap" Tabernero et al. 2022), which is the difference in feature distribution between theoretical and observed spectra. The impact of the synthetic gap can be appreciated in Figs ences with respect to high-S/N, high-resolution CARMENES 1 spectra, especially with faint lines for which parameters are not yet well constrained. In Figs. 1 and 2, the sequence of spectra is ordered according to T eff and metallicity from Schweitzer et al. (2019), which are not identical to the parameters estimated from interferometry or binary companions, as shown in the figures. This mismatch is an effect caused by the synthetic gap. Another effect can be seen by the flux differences of observed and synthetic spectra in the bottom panels of Figs. 1 and 2 (zoomed-in spectra). O'Briain et al. (2020) presented an interesting demonstration of the spectra transfer process, although their focus was rather different than ours. These authors showed that transferred spectra can reduce the synthetic gap from the pure physical models, which is further evidence of the value of transfer technologies.
The spatial dimension of features (i.e., the number of flux points within the wavelength window) depends on the size of the flux range, but it is usually very high (e.g., 3500 dimensions in the case of Figs. 1 and 2). Therefore, some specific techniques are needed to project a spectrum from such high-dimensional space into a lower one, while preserving inter-distances that help to better understand the topology. To this end, Passegger et al. (2020) introduced a technique to visualize the relative positions of a set of spectra in the 2D Euclidean space, the uniform manifold approximation and projection (UMAP; McInnes et al. 2018). The main purpose of these projections is to illustrate the difference in feature distribution between synthetic and 1 Calar Alto high-Resolution search for M dwarfs with Exo-earths with Near-infrared and optical Échelle Spectrographs, https://carmenes. caha.es. observed spectra, that is, the so-called synthetic gap. In the left panel of Fig. 3, different theoretical PHOENIX spectra are projected along with high-S/N, high-resolution, telluric-subtracted spectra observed with CARMENES. To make the synthetic spectra comparable to the observed ones, before plotting we included continuum normalization and instrumental and rotational broadening. However, no noise was added, as it was shown in Fig. 4 of Passegger et al. (2020) that adding only noise has a negligible effect on the projection. In this representation, no stellar parameters are involved and the UMAP only depends on the flux values of every spectrum. The theoretical feature map only partially covers the CARMENES range (green circles), with a significant part of the spectra projected far away from them. Some patterns emerge when additional information, such as a color code for T eff , is incorporated into the UMAP, as illustrated in both panels of Fig. 3, independent of the source of T eff : theoretical (left) or interferometric (right).
In this work, to reduce the uncertainties associated with the synthetic gap and therefore enable a more reliable estimation of stellar parameters, we propose a way to bridge the synthetic gap and transfer the knowledge from measured flux signals estimated by interferometry and FGK+M systems to the features derived from the theoretical models used with deep learning (DL). Such an approach is known as deep transfer learning (DTL) (Tan et al. 2018a;Awang Iskandar et al. 2020;Wei et al. 2020). For T eff , we transfer knowledge gained from interferometrically determined T eff for a few stars to the rest of the CARMENES spectra, while for metallicity we transfer knowledge gained from spectral synthesis of FGK stars. However, this DTL technique requires a significant amount of data, which is problematic because of the limited number of high-resolution spectra for stars fulfilling those conditions. This is even worse when data-based modeling techniques are used, as they require a methodology to assess the quality of the created model when applied to stars not used during the training phase. Despite these limitations, we show that the proposed technique is valid and its accuracy will increase as more stars with independent estimates of their parameters are incorporated.
In this paper, we use DTL to determine new T eff and [M/H] for 286 M dwarfs from the CARMENES survey Quirrenbach et al. 2020), and compare our results with the literature. As our technique is based on our previous work on DL, we refer to Passegger et al. (2020) for further information. The basic workflow of the DTL can be summarized as follows: (1) train DL models on a large set of synthetic model spectra, (2) extract the internal feature representations (3) train DTL models based on the external knowledge about stellar parameters that was transferred to the neural network, (4) calculate stellar parameter estimations for the stars. In Sect. 2 we explain the DTL procedure and our artificial neural network (ANN) architecture. Section 3 describes the values obtained from the literature (interferometry and FGK+M systems) for each parameter that we used for training the ANN, the stellar sample, and the application of our ANN. The derived stellar parameters are presented in Sect. 4, together with a literature comparison and discussion. Finally, in Sect. 5 we provide a short summary.

Methods
The aim of ML is to automatically discover rules that must be followed in order to efficiently map input data to a desired output. In this process, it is essential to create appropriate representations of the data. These representations are task-dependent and may vary according to the final task that the selected ML  Table 1 (green triangles and labeled). Colored circles represent their closest interpolated best-fit PHOENIX model, using Schweitzer et al. (2019) parameter estimations as a reference. References. algorithm is going to perform. DL is a subfield of ML where a hierarchical representation of the data is created, and has received increasing attention in recent years in light of its successful application to numerous real-world problems (e.g., virtual assistants, visual recognition, fraud detection, machine translation, medical image analysis, photo descriptions; see Karpathy & Fei-Fei 2015 and many others). The higher levels of the hierarchy are formed by the composition of representations of the lower level (Passegger et al. 2020). More importantly, this hierarchy of representations is automatically learned from the data by completely automating the most crucial step in ML, namely feature engineering. Automatically learning features at multiple levels of abstraction allows a system to learn complex representations mapping the input to the output directly from the data, without completely depending on human-crafted features. The word "deep" refers to the multiple hidden layers used to obtain those representations. In this sense, DL can also be called hierarchical feature engineering (Sarkar et al. 2018).
Data dependence is one of the most serious issues in DL, which is extremely dependent on massive training data sets when compared to traditional ML methods. Although the amount of data needed depends on the type of model, the required accuracy, and the complexity of the model, all these factors can lead to the requirement for large datasets. Therefore, an intrinsic and unavoidable problem has always been insufficient training data. Data collection is complex and expensive, making the generation of large-scale, high-quality annotated data sets extremely difficult. Therefore, techniques to work with data sets of limited size are of great value, as is the case for the DTL technique.  Fig. 4: Representation of the domains and tasks that the DTL process can be applied to.

Deep transfer learning
It has become increasingly common in various domains, such as image recognition and natural language processing, to pretrain the entire model in a data-rich task (Kraus & Feuerriegel 2017;Gao & Mosalam 2018;Raffel et al. 2020;Han et al. 2021). Ideally, this pre-training process causes the model to develop general-purpose abilities and knowledge that can then be transferred to downstream tasks. Goodfellow et al. (2016) referred to transferred learning (TL) in the context of generalization. These latter authors defined TL as the situation where what has been learned in one setting is exploited to improve generalization in another setting. Therefore, TL provides a robust and practical solution to leverage information from one domain to improve the accuracy of a model built for a different domain (Vilalta 2018). Pan & Yang (2010) proposed a more precise definition of TL, starting by defining a domain and a task, respectively. A domain can be represented by D = χ, P(X), which contains two parts: the feature space χ and the marginal probability distribution P(X), where X = {x 1 , ..., x n } ∈ χ . The task can be represented by T = {y, f (x)}, and consists of two parts: a label space y and a target-prediction function f (x). This function f (x) can also be regarded as a conditional probability function P(y|x). Then, given a learning task T t based on D t (where the subscript t refers to "transferred"), TL is designed to improve the performance of a predictive function f T (·) in learning the task T t by discovering and transferring latent knowledge from another domain D s and learning task T s (where the subscript s refers to "source", which in our case is the PHOENIX-ACES synthetic models), where D s D t and/or T s T t . Usually, the size of the source domain D s is much larger than the size of the transferred domain D t (i.e., N s N t ). Based on the previous definitions, a DTL task is de- is a nonlinear function involving a deep ANN. TL relaxes the hypothesis that the training data must be independent of and identically distributed with the test data, which motivates the use of TL for the problem of insufficient training data.
The popularity of DL has led to many different DTL methods, and several authors have proposed a classification of them (Tan et al. 2018b;Zhao et al. 2021). Common categories involve instance-, mapping-, network-, and adversarial-based TL. Each of these categories has its particular applicability, considering the specific context and characteristics of domains and tasks < D s , T s , D t , T t , f T (·) >. In our case, due to the characteristics of the problem, we selected the network category to implement our DTL. The relationship between domains and tasks is illustrated by Fig. 4. Network-based DTL refers to reusing the part of the pre-trained network in the source domain, including its network structure and connection parameters, and transferring it to be a part of the deep neural network that is used in the target domain. The main assumption is connected with the idea that the features identified in the source domain will be valid in the transfer domain, whereas f T (·) requires adaptation. As a result, we kept the original distribution of fully connected layers that we applied for the DL analysis by Passegger et al. (2020). Moreover, Passegger et al. (2020) constructed several neural network models for different spectral regions, finding that the results for all regions are comparable, but that the region 8800-8835 Å gives the smallest validation error. Therefore, we also adopted this strategy and only use that region. In this sense, we rely on the DL models already trained in Passegger et al. (2020) and extract the internal feature representations for each star in order to use it here as a new input for the DTL process. The DL models were trained on a grid of 449 806 synthetic PHOENIX-ACES spectra, after removing unphysical stellar parameter combinations not corresponding to main sequence stars using the PARSEC v1.2S evolutionary models (Bressan et al. 2012;Chen et al. 2014Chen et al. , 2015Tang et al. 2014). The feature representation was taken from the flattened layer of the DL model, as represented in Fig. 5.

DTL training and testing
To build the transfer domain D t for T eff , we started with 14 stars of the CARMENES survey with interferometric angular diameters θ LD measured by Boyajian et al. (2012), von Braun et al. (2014, and references therein. We did not use the derived T eff from these publications. Instead, we used updated bolometric fluxes S at Earth based on the most recent photometry (in particular, Gaia photometry for the optical passbands) collected by Cifuentes et al. (2020) to calculate our reference T eff with the distance-independent form of the Stefan-Boltzmann law: As Cifuentes et al. (2020) did not actually tabulate the bolometric flux at Earth S , we used the tabulated bolometric luminosities L and the distances d, which were used in measuring L, and calculated S via S = L/(4πd 2 ). All used and derived values (L, d, S , θ LD , and T eff ) are listed in Table 1.
As for Gl 15A our derived T eff from Eq. 1 was in severe disagreement with T eff listed in Boyajian et al. (2012), we realized that for this bright star, as well as for the bright star Gl 411, the luminosities in Cifuentes et al. (2020) were missing reliable J band photometry. Therefore, we revise the luminosity determinations of Cifuentes et al. (2020)  J band photometry as well as the Gaia Early Data Release 3 (EDR3) data. Gl 205 was not included in Cifuentes et al. (2020), but its L is derived in the same fashion. These three stars are marked in Table 1 with 'This work'.
As the 14 stars of the interferometric sample include only two M dwarfs with T eff < 3440 K, we supplemented them with five mid-to-late M dwarfs listed in Table 2, for which a good T eff estimation is available in the literature, as recommended by Passegger et al. (2022). This was done in order to obtain training and validation sets with regularly spaced parameters. To achieve such regular distribution, we binned the to-betransferred data set of 19 M dwarfs into a variable number of bins. The goal was to have as many as 75 % nonempty bins. For each of those bins, we selected a representative element. If two or more stars were included in one bin, we picked the closest to the midpoint of the bin.
For the metallicity, [M/H], the adopted strategy followed the same structure as for T eff . We reverted to metallicities measured for FGK stars with a proper motion M-dwarf companion. Due to their formation from the same cloud, it is assumed that both components share the same metallicity (Desidera et al. 2006;Andrews et al. 2018). Table 3 presents five M dwarfs with CARMENES spectra with an FGK primary with known metallicity obtained from Montes et al. (2018), and one from Tabernero et al. (2022), namely J14251+518 (θ Boo B). Due to the low number of multiple systems, our transfer strategy is to use the list of 18 stars from Passegger et al. (2022) in Table 4 combined with those listed in Table 3 as D t . Although the values from Passegger et al. (2022) are not as accurate as those from binaries, they do not depend strongly on a specific model because they were calculated as medians from several literature values.
To avoid the potential lack of generalization linked to accepting models based solely on their performance over the validation set, we propose using a more robust methodology, which is designed to create two different and separate groups of samples: one set for training and validation, and the rest of the stars as a test group to measure the quality of the models. Indeed, because of the relatively significant influence that a single sample can have on the model performance, which is due to the low number of samples in the training or validation subsets, we used the cross-validation training approach. Cross-validation is a data-resampling method to assess the generalization ability of predictive models and to prevent overfitting (Refaeilzadeh et al. 2009). Briefly, the data are usually divided into two segments:  one used for training a neural network model and one used for validating the trained model. The basic form of cross-validation is a k-fold cross-validation, where the data are divided into k folds (in our case, we set k = 4) before k iterations of training and validation are conducted, where in each iteration a different fold is kept aside for validation, while the remaining k − 1 folds are used for training. DTL model creation requires a quality criterion to assess the learning progress of the ANN. The quality criterion that is often adopted is a threshold in the loss error during the validation process (the loss function is widely used in mathematical optimization and decision theory). Looking to obtain a sufficient variety of models due to randomness in the selection of samples and the optimization starting point, several repetitions of the model-creation process were accomplished. As a fourfold cross-validation strategy was adopted, four potentially valid models were created per repetition of the model-creation process, and another four validation loss errors were measured at the end of the training processes. The trained models were deemed of sufficient quality when the validation error was lower than 0.01. In order to have a significant set of models in the case of metallicity, 80 repetitions of the model-creation process were adopted, which means 80 × 4 = 320 potential models. Only 121 of these reached convergence under the adopted criterion, and they were later used for predictions. In the case of T eff , 20 repetitions of the model-creation process were adopted and all the 20×4 = 80 potential models reached convergence. This behavior indirectly shows that, as expected, the T eff parameter has more power than metallicity in terms of impact on the spectra.
It is straightforward to build a model that is perfectly adapted to the data set at hand but then unable to generalize to new and unseen data. Therefore, the value of measuring model quality over an independent data set becomes evident. In this way, the model performance can be externally assessed using the estimation provided by the information gathered during the validation step (Vabalas et al. 2019).

Observational sample
To test our DTL method, we used the same template spectra as in Passegger et al. (2019) and applied it to all 282 M dwarfs listed in their Table B.1 plus four more stars coming from the independent interferometric sample used for the learning process. We focus here on a small sample of the galactic stellar population, namely M dwarfs of spectral type M0 to M6. To verify whether or not our method is able to generalize beyond this parameter range, it is necessary to apply it to a much larger stellar sample, such as APOGEE or Gaia, which shall be part of a subsequent study. The stars were observed with CARMENES on the Zeiss 3.5 m telescope at the Observatorio de Calar Alto, Spain. CARMENES combines two highly stable fiber-fed spectrographs covering a spectral range from 520 nm to 960 nm in the optical (VIS) and from 960 nm to 1710 nm in the NIR, with spectral resolutions of R ≈ 94 600 and 80 400, respectively (Quirrenbach et al. 2018;Reiners et al. 2018). The primary goal of this instrument is to search for Earth-sized planets in the habitable zones of M dwarfs (e.g., Zechmeister et al. 2019).
For a detailed description of our data-reduction procedure, we refer to Zechmeister et al. (2014), Caballero et al. (2016), and Passegger et al. (2019). As in the latter, we used the high-S/N template (co-added) spectrum for each star. These templates are a byproduct of the CARMENES radial-velocity pipeline serval (SpEctrum Radial Velocity AnaLyser; Zechmeister et al. 2018). In the standard data flow, the code constructs a template for every target star from at least five individual spectra to derive the radial velocities of a single spectrum by least-square fitting to the template. For our sample, the average S/N of the order, in which our investigated wavelength window of 8800-8835 Å lies in the beginning of the order, amounts to 258 ± 158.
Before creating the templates, the NIR spectra were corrected for telluric lines. We did not use the telluric correction for the VIS spectra because the telluric features are negligible in the investigated range. The telluric correction was explained in detail by Nagel et al. (2020). For normalization of our spectra, we used the same method and routine as in Passegger et al. (2020), the Gaussian Inflection Spline Interpolation Continuum (GISIC 2 , developed by D. D. Whitten and designed for spectra with strong molecular features). After the spectrum was smoothed with a Gaussian, and continuum points were selected, the pseudo-continuum was normalized with a cubic spline interpolation. We applied the same procedure to both observed and synthetic spectra within the spectral window 8800-8835 Å by adding 5 Å on each side to avoid possible edge effects. The observed spectra have been corrected for radial velocity to match the rest frame of the synthetic spectra using the cross-correlation (crosscorrRV from PyAstronomy, Czesla et al. 2019) between a PHOENIX model spectrum and the observed spectrum. To obtain a universal wavelength grid, which is necessary for applying the DL method, the wavelength grid of the observed spectra was linearly interpolated with the grid of the synthetic spectra.

Transferred knowledge
In our particular case, where the distance between domains is significant (see Fig. 3) and the sample density in D t is limited (see Tables 1 and 3), the network-based approach was selected as the adequate DTL method. In Fig. 6, the stars with interferometric T eff are regularly distributed in T eff along the whole CARMENES data set. This is needed and must be checked before transferring any knowledge from another study, as DL and DTL techniques are not very good at extrapolating information because of the limited stellar parameter range of our transferred and training sets.
In terms of the terminology introduced in Sect. 2.1, our D s domain was built over the PHOENIX-ACES spectra library (Husser et al. 2013) with a flux window of 35 Å between 8800 Å and 8835 Å in the VIS channel for consistency with previous work (Passegger et al. 2020(Passegger et al. , 2022. Furthermore, T s is the DL model that minimizes the error on a test set of unused 2 https://pypi.org/project/GISIC/ PHOENIX-ACES spectra. In other words, the DL model selected for transferring the feature space is the best one of those trained on synthetic PHOENIX-ACES spectra. The definition of D t for the two stellar parameters is introduced in Sect. 2.1. From this, up to 80 different transference models were created. If their training process reaches convergence, they were selected as contributing to the prediction of the stellar parameters of the stars in the test set. Proposals from different transferred models are collected and integrated using the kernel density estimate (KDE) technique. This technique allows us to establish the most frequent value for the stellar parameter, but also its uncertainty, which depends on the star and the flux window. This KDE estimation can be seen as the predictive function f T (·).

Implementation
As already explained in Sect. 2.1, transfer learning is an approach in DL (and ML) where knowledge is transferred from one model to another. This means that a properly trained DL model is the first step, as presented in Fig. 5. For the DTL network-oriented approach, we kept the features selected by the DL model, which means freezing the convolutional transformers and providing a new deep ANN configuration. This enabled us to train the weights of the connections for stellar parameters according to the interferometric measurements, binary companion estimations, or both. The architectures used for convolutional layers and the deep neural network are presented in Tables 5 and 6, respectively.
Once the configuration has been defined, the training process takes place. Due to the fact that convolutional layers are frozen, the evolution of models is due to the update of variable weights, which requires a larger number of iterations than for training the whole convolutional neural network (CNN) at once. Indeed, specific attention must be paid to the cross-validation strategy used to avoid over-fitting, which means that the same number of models as folders is required. Different numbers of folders were tested in the cross-validation process (from three to five), as well as different bases for features derived from different DL models, which required a significant number of repetitions for the training procedure before being able to propose a set of DTL models.
In the present case, and because of the limited number of available samples, the best method for measuring the global quality of the whole data set was using four cross-validation folders, which were then adopted for the implementation. Therefore, the number of computing operations in the training step is expected to be large, and advanced computing capabilities are targeted to keep the effort bounded. As the operations are tensor-based computations, the use of existing frameworks an save a lot of time.
The adoption of the TensorFlow framework (Abadi et al. 2016) for the creation of DL models enables the use of accelerated hardware based on Nvidia general-purpose graphics processing unit (GPU) cards, which outperform the central processing unit (CPU) in terms of computation time by around a factor of 20 (Mittal & Vaishay 2019). As the base DL models were selected from those performed in Passegger et al. (2020) and were able to identify the best features, the same framework was retained for the current implementation of DTL. Current features were extracted from the aforementioned models, and a complete new deep ANN was configured and trained over the new set of spectra with better stellar parameter estimation. In this particular case, because training involved only adjustment of the deep ANN weights, several thousand epochs were required to produce and estimate the adapted function.
In this application, we used GPU cards with 11 GB of RAM and 4352 computing cores. The training time for a model experiencing proper convergence depended on the training data size, but also on the architecture and number of epochs, and varied between 45 minutes and two hours. For a more detailed description of the general design of a DL neural network, we refer to Passegger et al. (2020).
The same methodology used by Passegger et al. (2020) for uncertainty estimation was considered here, where parameter estimations from each DTL model were collected and the probability density function was determined using KDE (Scott 2015;Terrell & Scott 1992;Wang & Li 2017). Based on such a proba-bility density function, the maximum was retained as a confident estimation of the parameter. This was done for each star and stellar parameter separately. To provide the uncertainty for each star and parameter, the ± 1 σ thresholds of the predictions were calculated.

Results and discussion
We introduced an algorithm-independent assessment of precision in the prediction of T eff and [M/H]. This was carried out thanks to the testing data set, where stars not seen before by the model during its knowledge transference were used as the gold standard for its assessment for estimating stellar parameters. First, we applied the DL method presented by Passegger et al. (2020) by selecting the best DL model that predicts T eff from PHOENIX-ACES. In doing so, the hypothesis that the derived models use the most relevant feature set after the convolution step remains acceptable. This first step also provided a set of comparisons of stellar parameters derived from DL.
According to the steps indicated in Sect. 3.2, starting from the best DL model, the network-based transfer procedure was carried out within the samples from Tables 1 and 2. In this procedure, 14 stars were used during the four-fold cross-validation approach, keeping aside the remaining five stars in Table 7. After creating the most suitable model, its quality was assessed by measuring the residual errors with the available information from interferometry. The results for the five stars kept out of the training validation process of DTL were then used for independent quality assessment. In the left panel of Fig. 7, we compare the interferometric values for these five stars to results from DTL, DL, and Schweitzer et al. (2019, Schw19), as all these studies used PHOENIX-ACES models. For more than half of the stars, the accuracy of the DTL approach exceeds that of the DL and Schw19 approaches, while for only two stars (J15194-077 and J09144+526) the literature gives closer results to the interferometric values than DTL.
By using the same strategy as for T eff , [M/H] data from Tables 3 and 4 enabled us to use 18 stars for a four-fold crossvalidation approach, keeping another 6 stars aside this time (see Table 8). The right panel of Fig. 7 shows that the accuracy of the best deep transferred model for those six stars is significantly better than other existing estimations. For only one star (J22565+165), DL gives a closer parameter estimation to the transferred value than DTL. For the rest of the stars, DTL produces more accurate results than the other methods.
Finally, we quantitatively assessed the accuracy improvements of our method with respect to previous implementations. For metallicity, the median absolute differences (MADs) with respect to the transferred values are 0.07 dex for DTL, 0.27 dex for DL, and 0.13 dex for Schw19. On the other hand, for T eff , Schw19 provides the smallest MAD with 60 K. However, DTL outperforms DL with differences of 91 K and 110 K, respectively.

Uncertainties
Until this point, this paper proposes a technique to transfer the features identified as relevant according to the stellar parameter of interest and the selected flux window. This technique uses precise estimations of stellar properties from interferometry and binary observations to adapt the knowledge domain based on the previous DL feature identification. Our technique provides good estimations for the stars used for quality control as it outperforms  Tables 7 and 8. the reference estimations for the sample of selected stars in the majority of cases.
When the same set of features is transferred through a different model to adapt the knowledge domain, a different subset of features (components of the feature vector) can be selected. This leads to different parameter estimations after a converged training process, which reflects the effects of the different components of the feature vector. Therefore, we defined uncertainties in the DTL process for estimating stellar parameters as ±1σ around the most frequently predicted value for each parameter. In this way, each star has its own uncertainty interval, which does not have to be symmetric.
The DTL process can be repeated several times, making it possible to get different transferred models, each of them weighting the different extracted features differently. The proposal was to retain the models that are over a specific quality threshold and then to aggregate the estimates from them using an integrated KDE technique, as mentioned in Sect. 3.2. An example for T eff and for [M/H] is shown in Fig. 8. Indeed, the observed shapes allow us to discuss the number of DTL models considered to produce good-quality estimations. We selected 80 DTL models to be considered as estimate makers, which is a matter of design. However, for some stars, more evidence was needed in order to reduce the uncertainty level, as illustrated in the right half of the left panel in Fig. 8, where the tail increases the uncertainty value for the 84 % quantile. The estimated uncertainties for the selected quality stars are presented in Table 7 for T eff and Table 8 for [M/H] parameters.
After testing the quality of the DTL models for estimating T eff and [M/H], they were applied to the entire CARMENES data set. The outcome can be found in Table A.1. Stars with high rotational velocities and therefore with significantly higher uncertainties, were included even when the training data set did not contain this type of object. In Table A.1, there are several parameters with significantly small uncertainties, even smaller than those of the training sample (e.g., ∆T eff = 10 K). The uncertainties that we provide for the estimated stellar parameters Notes.  J04153-076 −0.37 ± 0.02 +0.25 ± 0.19 −0.23 ± 0.05 ... J07361-031 −0.11 ± 0.03 +0.16 ± 0.10 −0.16 ± 0.04 +0.05 ± 0.16 J11421+267 +0.00 ± 0.11 +0.18 ± 0.09 −0.01 ± 0.03 −0.04 ± 0.16 J13005+056 +0.09 ± 0.10 +0.35 ± 0.14 +0.16 ± 0.10 −0.04 ± 0.16 J17578+046 −0.39 ± 0.12 +0.08 ± 0.14 −0.29 ± 0.06 −0.15 ± 0.16 J22565+165 +0.18 ± 0.09 +0.23 ± 0.09 +0.11 ± 0.03 +0.12 ± 0.16 Notes. refer to the internal error of the method, and therefore they do not take into account uncertainties from the interferometric and binary training samples or from the synthetic gap.  We also present the Pearson correlation coefficient r P , which we use to assess the goodness of the correlation. A summary can be found in Table 9. Figure 9 presents the comparison with literature values for T eff . The top panel shows results from DL, Marfil et al. (2021, Mar21), and Schw19. All three studies used CARMENES data and show a similar pattern, with hotter temperature for stars be-low 3500 K and above 3800 K compared to DTL. For DL, the dispersion is larger than for the other two methods, also showing a slightly smaller r P of 0.87 compared to 0.88 for Mar21 and 0.90 for Schw19. The main factor leading to this effect in DTL is the limited number of samples with temperatures above 4000 K in the training and validation sets. Another possible explanation could be a change in opacity at around 3300 K, but a detailed analysis of the synthetic model structures would be necessary to come to any robust conclusions.
In the middle panel, all literature references determined T eff by fitting BT-Settl synthetic models (Allard et al. 2011) to VIS spectra. Gaidos & Mann (2014, GM14) additionally used spectral curvature indices in the K band if there were no VIS spectra available. Perhaps this difference explains a general trend towards hotter temperatures compared to DTL (on average +121 K, with r P =0.92). Gaidos et al. (2014, Gaid14) and Mann et al. (2015, Mann15) achieve similar results to with DTL, r P =0.86 and 0.94, respectively. The correlation of Lépine et al. (2013, Lep13) is weaker with r P =0.81, providing some cooler values at the hot end of the T eff scale. our T eff values also exists compared to other literature works, which indicates an intrinsic underestimation of temperatures by Nev14. Kha20 determined EWs of Mg (1.57 µm, 1.71 µm), Al (1.67 µm), and the H 2 O index in the H band, and performed linear regression on 12 M-dwarf calibrator stars with interferometrically measured T eff to derive a temperature relation. The standard deviation of the residuals of the calibrators amounted to 102 K. However, the spread with respect to our results and other literature studies is much larger, with a standard deviation of 261 K. An indication of a higher deviation in the comparison with the literature can already be seen in Fig. 5 of Kha20, where these authors compare their results with T eff from Mann15 and RA12, and measure standard deviations of 164 K and 158 K, respectively. At this point, we cannot give a clear explanation for this behavior. Overall, the correlation between DTL and the literature is quite good, except for Mar21, Schw19, and Kha20, where we are not sure about the source of the differences seen. Figure 10 presents the same comparisons for metallicity. As explained by Passegger et al. (2020) Passegger et al. (2020Passegger et al. ( , 2022. The results from Schw19 are consistent overall with DTL, with r P = 0.54, although they exhibit a certain spread. The middle panel of Fig. 10 compares similar determination methods from the literature. RA12 and Kha20 derived [Fe/H] using the H 2 O-K2 index and equivalent widths (EWs) of Na i and Ca i in the NIR. Nev14 also incorporated a relation with EWs, but only Dittmann et al. (2016, Ditt16) determined [Fe/H] from a color-magnitude-metallicity relation. This might explain the large spread of the latter values, resulting in r P = 0.50. Results from Nev14 are overall more metal-poor than those provided by DTL and show a large spread as well, but a better correlation than Ditt16, with r P = 0.71. However, values from RA12 and Kha20 correspond even better with DTL, showing r P = 0.86 and 0.81, respectively.
The literature values shown in the bottom panel of Fig. 10 were determined using empirical relations between atomic line strength, Na i and Ca i EWs, the H 2 O-K2 index, and metallicity calibrated with FGK+M binaries based on Mann et al. (2013aMann et al. ( ,b, 2014 relationships. The values provided by GM14, Mann15, and Gaid14 are highly correlated, with those of Gaid14 showing the least spread (r P = 0.89, 0.87, and 0.72, respectively). Values from New15 are generally slightly more metal-poor, with a smaller correlation coefficient of r P = 0.69. For higher metallicities, Terrien et al. (2015, Ter15) derived a larger spread and some outliers at both ends. The correlation coefficient is the same as for New15. Similar to T eff , DTL values for metallicity correspond well with most of the literature, and an improvement with respect to DL can be appreciated, which is very promising.

Summary and conclusions
We present a DTL neural network technique that improves the estimation of the stellar parameters T eff and [M/H] for M dwarfs from high-S/N, high-resolution optical spectroscopy obtained with CARMENES. The initial DL model was trained with PHOENIX-ACES synthetic spectra, which confer the advantage that they allow a sufficient number of spectra to be generated with known stellar parameters. Based on the DL convolutional features, different DTL models were trained and tested. To use a more robust procedure, a cross-validation scheme was adopted. Using the proposed technique could help to bridge the synthetic gap affecting stellar parameter estimation based on synthetic libraries. However, a larger stellar sample covering a wider spectral range is needed to verify this.
Before applying the created models to a large data set, we defined an independent quality assessment procedure based on specific stars for which high-quality stellar parameter estimations are available. This assessment shows that the DTL technique has good prediction capabilities. In addition, we incorporated an uncertainty estimation procedure based on considering the diversity of estimates from different transferred models, as well as an aggregation procedure. Such an estimation is fluxwindow dependent, but also star dependent, because the trained DTL models below the convergence threshold depend on them.
Another relevant aspect to be considered for parameter estimation concerns the selected flux windows, as they have their own influence on the vector of features, and, in the end, on the estimated parameters. A possible continuation of the line of research in this field is applying reinforced learning techniques based on the behavior of the selected flux windows.
Finally, and importantly, a limitation of the proposed method is the parameter range of the observed transferred knowledge. This means that the parameters of stars with large sin i values cannot be estimated rigorously with this technique, as no interferometric or FGK-star companion values are available with such large values of rotational velocity. In the same way, the transferred knowledge works only for T eff higher than 3100 K, as no cooler stars were part of the training set yet. Therefore, our current analysis is limited to spectral types between M0 V and M6 V.
In summary, we propose an innovative technique that can increase the value of its predictions as new high-quality stellar parameters -namely T eff from interferometry and [M/H] from FGK+M systems-become available in the near future. The current data sample used is close to the operational limits of the technique, while in some cases the data set was complemented with stellar parameters estimated from the literature. Therefore, improvements are expected to be possible when more stars with highly reliable stellar parameters and high-resolution spectra become available. In the meantime, the lack of a sufficiently large number of samples is a limitation for the technique. Continued on next page