Unsupervised Domain Adaptation for Constraining Star Formation Histories

The prevalent paradigm of machine learning today is to use past observations to predict future ones. What if, however, we are interested in knowing the past given the present? This situation is indeed one that astronomers must contend with often. To understand the formation of our universe, we must derive the time evolution of the visible mass content of galaxies. However, to observe a complete star life, one would need to wait for one billion years! To overcome this difficulty, astrophysicists leverage supercomputers and evolve simulated models of galaxies till the current age of the universe, thus establishing a mapping between observed radiation and star formation histories (SFHs). Such ground-truth SFHs are lacking for actual galaxy observations, where they are usually inferred -- with often poor confidence -- from spectral energy distributions (SEDs) using Bayesian fitting methods. In this investigation, we discuss the ability of unsupervised domain adaptation to derive accurate SFHs for galaxies with simulated data as a necessary first step in developing a technique that can ultimately be applied to observational data.


Introduction
In recent times, many transfer problems have arisen, and various methods exist to solve them.We can broadly classify these into three categories.First, supervised domain adaptation, where only a few labeled target data are available (Dai et al. 2007;de Mathelin et al. 2020;Motiian et al. 2017a,b).Second, semi-supervised domain adaptation, where in addition to these labeled data, a large amount of unlabeled data is available (Kumar, Saha, and Daume 2010;Saito et al. 2019;Tzeng et al. 2015).Finally, unsupervised domain adaptation, where only unlabeled data is available in the target (Ganin et al. 2016;Huang et al. 2007;Richard et al. 2020;Saito et al. 2018;Sugiyama et al. 2007).Researchers have shown that adding target labels can increase the performance of the model (Motiian et al. 2017a).This setting is also the most encountered in practice: (Cortes, Mohri, and Medina 2019), as it is often possible to label at least a few target samples.
Figure 1: Radiation spectrums with corresponding star formation histories can be recorded from cosmological simulation to form a dataset (X, y).This dataset is used to learn a machine learning model.Because of the domain shift between real radiation spectrums and simulated ones, domain adaptation is used to adapt the learned machine learning model to the real observations.However, there are cases in which getting supervised data is not possible.Such is the case of the present astrophysics problem of SFH prediction.Here, we aim to learn the history of our Universe through the evolution of its galaxies' masses throughout their individual histories.We try to obtain this history from the radiation that reaches us on earth -the Spectral Energy Distribution (SED), a function of the brightness of a galaxy with the wavelength of observation.
As galaxies evolve over billions of years, it is impossible to completely capture any single galaxy's SED as a function of time.Astrophysicists then generate models of light through physically-motivated mechanisms, which allow building artificial star formation histories with the corresponding radiation.One of the objectives of these artificial data sets is to find the correspondence between the radiation and the SFH.We try to find a function that can robustly perform this mapping.Classically, Bayesian models have been developed : PROSPECT (Robotham et al. 2020), MAGPHYS (da Cunha, Charlot, andElbaz 2008), CIGALE (Noll et al. 2009), BAG-PIPES (Carnall et al. 2018), PROSPECTOR (Johnson and Leja 2017), and numerous others, more recently MIRKWOOD, (Gilda, Lower, and Narayanan 2021) showed that deep learning approaches have the potential to infer SFHs from SEDs.
A significant challenge with applying all the above models arises due to the 'domain shift' between simulated and real galaxies.Even a deep neural network (DNN) trained only on the synthetic data sets will falter when predicting the SFHs of real galaxies.Therefore, in this work, we investigate the use of unsupervised domain adaptation as an alternative method of extracting SFHs for individual galaxies without being limited by the simplifications of physical models and parametrizations of conventional SED-fitting techniques.We turn toward cosmological, hydrodynamic simulations, including EAGLE (Schaye et al. 2015a), ILLUSTRISTNG (Nelson et al. 2018;Pillepich et al. 2018) and SIMBA (Davé et al. 2019).These simulate a volume of the Universe from shortly after the Big Bang to the present day, recording the evolution of dark and baryonic matter over time.Galaxies formed within simulations can be studied as a proxy for real galaxies and can be compared directly to observations.They can thus serve to augment our understanding of the evolution of the Universe.

Data
Our training and test data sets consist of SEDs from three state-of-the-art cosmological galaxy formation simulations, where the true physical properties are known -including their true star formation histories.SIMBA (Davé et al. 2019), EAGLE (Schaye et al. 2015b;Schaller et al. 2015;McAlpine et al. 2016), and ILLUSTRISTNG (Vogelsberger et al. 2014), with 1,688, 4,697, and 9,633 samples respectively, together constitute a diverse sample of galaxies with realistic growth histories.Specifically, we select galaxies at a redshift of 0this corresponds to simulated galaxies at the "present" epoch.Each SED is comprised of 20 measurements of the "brightness" of a galaxy at different wavelength, in the form of flux density (with units of Jansky) and are the co-variates or features that we train/learn on.The outputs/labels are star formation history (SFH) time series vectors with 29 scalar elements, in units of solar mass per year (M yr −1 ).
The SFH for a galaxy informs us about the net stellar mass generated as a function of time -sum of masses of all stars born, less the sum of stellar mass lost in stellar winds as stars age and eventually die.In other words, SFHs are plots of the star formation rates (SFRs) of galaxies against the lookback time, which for galaxies at z = 0 extends from 0 to the very age of our Universe (∼ 13.8 Gyrs).By stacking (adding) together SFHs of several thousand galaxies from a simulation, and dividing by the volume of the simulated box (i.e. the size of the simulated universe), we can derive the cosmic star formation rate density (CSFRD), which is a wellstudied global property of our Universe.Astrophysicists aim to tune various physical properties in any given simulation until the the CSFRD plot derived from it matches closely to the most-widely accepted one derived from observations (Madau and Dickinson 2014). 2e sequentially train on any two of these simulated data sets, and predict on the third, thus giving us three sets of source-and target-domain data and results.

Methodology
We consider the problem of prediction of star formation history (SFH) where the learner has access to a data set X ∈ R n×p encoding the radiations of n galaxies with p the number of filters/wavelength-bins, and a data set Y ∈ R n×T giving the corresponding SFH for each galaxy with T is the time length of the history.Each SED (row of X) is comprised of p = 20 measurements of the "brightness" of a galaxy at different wavelength, in the form of flux density (with units of Jansky) (see Figure 2).Below are our pre-processing steps: 1. First, we create three sets of experiments.For each, we use two galaxies in the training and validation sets (with a 9:1 split) and the third galaxy in the test set.2. Second, we normalize each SFH time series (each row of Y ) by its sum and store the resultant normalized SFH (SFH norm ) and the sum (SFH sum ) separately.This step is needed because of the large dynamic range of the various star formation histories (see Appendix D): SFH curves have a large variety of scales (some increasing to more than 100 whereas others never increase over 0.1.By scaling each SFH time series, we make the learning of the curve trend easier.The learning of the SFHs is now decoupled in the learning of SFH norm and SFH sum .3. Third, we further ease the learning of the SFH norm by reducing the curves to their first 3 Kernel-PCA components (Soentpiet et al. 1999).The choice of this decomposition method is motivated by extensive experimentation that showed that Kernel-PCA beat both linear PCA (Jolliffe and Cadima 2016) and discrete wavelet transform (Shensa 1992) in their ability to recreate the original time series successfully.For each of the three experiments, we provide the Kernel-PCA with a wide range of hyperparameters and pick the ones that can recreate the original SFH time series back, judged according to the DILATE loss metric (Le Guen and Thome 2019) (see Appendix A and Figure 4).The DILATE similarity metric between two time series is defined as an equally weighted average of the dynamic time warping (DTW) and the temporal distortion index (TDI) similarity scores.The number of principal components (3) was chosen by selecting the smallest set that explains at least 80% variance in the validation set.We refer to these kernel-PCA components as SFH kPCA .4. Fourth, we normalize the input features (the columns from X) via log-scaling, and follow this up by standard scaling normalization.In Figure 2 we visualize the log-scaled flux densities, in units of Janskies, for all three simulations, in 3 out of 20 filters. 5. Finally, we derive KLIEP weights for the training samples in all three experiments.
After performing these steps, we apply a domain adaptation method to correct the shift between the source and target input distributions.We choose the method KLIEP (Sugiyama et al. 2007), an instance-based method that reweights the sources in order to minimize the KL-divergence between the  The feature names at the top of each plot are the names of the filters, each centered at a different wavelength, in which photometry was simulated.We notice that for the three features shown here, all three simulations share the same support, which justifies our decision of using KLIEP.two domains.Instance-based approaches have been widely used to handle regression domain adaptation issues (Cortes and Mohri 2014;Huang et al. 2007;Mansour, Mohri, and Rostamizadeh 2009;Sugiyama et al. 2007), and are particularly robust to negative transfer (de Mathelin et al. 2020).We also visually observe on the marginal distributions for the 20 input filters that all domains have the same support in the feature space, which is the framework considered by KLIEP.Finally, KLIEP has the critical advantage of proposing an unsupervised selection procedure to select a relevant bandwidth.
For each of SFH kPCA and SFH sum , we train a 4 layer feed-forward DNN with drop-out and 256 nodes in each layer.We use ReLu as the activation function, and Adam (Kingma and Ba 2015) as the optimizer, with learning rate of 1e −3 .We train for 200 epochs, with early stopping to prevent over-fitting.

Results
There are two main ways of assessing the SFH outputs derived via our machine learning implementation: 1. Comparing the derived SFH to the true SFH for individual galaxies in the test set.2. Comparing the predicted ΣSFH to the true ΣSFH for each simulation.True ΣSFH is the sum of SFH for all galaxies in a simulation and is a critical metric enabling us to verify the correctness of input physics in a simulation.We know from observations (Madau and Dickinson 2014) that star formation in the observable Universe peaked about 2 Gyrs after its formation, and all hydrodynamical simulations must produce galaxies that satisfy this observation.By ensuring that the ΣSFH curve inferred from our neural networks matches those from the underlying simulations, we ground our predictions in science while simultaneously enabling appropriate tuning of model architectures, loss functions, and hyperparameters in case of mismatches.Such comparison also enables us to assess any systematic effects in modeling when training using one simulation and comparing predictions on another.
In Figure 3, we plot the true and derived ΣSFH curves for the three distinct test datasets (SIMBA, ILLUSTRISTNG, and EAGLE), where the training data consists of samples from the other two simulations.In these cases, the resulting sample SFHs (and thus ΣSFH) are most deviant from their ground truth vectors.This is unsurprising, as the different simulations have intrinsically different sample SFHs; this is a known difference between different simulations (as shown in Figure A1 of Bellstedt et al. 2020).Tables 1 and 2 show the five metrics tested within this work (MAE, RMSE, BE, DTW, and TDI) for average predictions of SHF, and for ΣSFH, respectively.Figures 7, 8, and 9 show examples of true and derived SFHs for individual galaxies within the simulations using KLIEP.Based on the five metrics, each row shows an example of the best-performing galaxy output on the left and the worst-performing galaxy on the right.The first thing to note here is that in most cases, the best-and worst-performing galaxies are different when we use different metrics to pick them.This discrepancy highlights the fact that each metric compares time-series differently, and hence it is difficult to pick one metric as the loss function to minimize.
Another observation made from these three figures is how well our technique can reproduce the stochastic nature of the simulated galaxies' SFHs.As star formation can be an incredibly stochastic process (as is clear from examples such as the top-left panel in Figure 7 and the top-left panel in Figure 8), SFHs can regularly fluctuate between high and low values.In general, we find that such stochastic SFHs are poorly recovered.For the sake of galaxy property analysis, the accurate recovery of overall SFH trends is more important than the recovery of individual star formation rate (SFR) epochs.In the bottom-right panel of Figure 7, for example (galaxy index 333) it can be seen that there is a star formation event early on at ∼ 10 − 12 Gyr, and then a secondary star formation event from ∼ 2 Gyr to the present day.Recovering these two main epochs is more crucial than correctly recovering the individual SFR peaks within each epoch.One way of achieving this is to temporally smooth the SFHs of individual galaxies prior to training and testing.We can justify such smoothing of simulated features given that we would never expect the derived SFHs for observed galaxies (the ultimate aim of this work) to reproduce such short-scale stochastic features.As an example of this, see Figure C1 of Robotham et al. (2020), where "good" fits to the SFHs from the semi-analytic model SHARK (Lagos et al. 2018) from the SED-fitting code PROSPECT do not recover the stochastic SFHs.Results presented here are only one part of a continuing effort to apply sophisticated data-driven methods to SED-fitting, a crucial first step for extracting galaxy properties.In future work, we will explore advanced multi-source domain adaptation approaches (Zhao et al. 2018;Richard et al. 2021) which should more robustly account for the presence of multiple source-domain datasets as well as adaptively reweight the training samples depending on the prediction task, thus providing higher quality results.We are also actively analyzing and attempting to correct systematic biases induced by our various design choices.How do our SFH inferences for galaxies of low mass compare to those of high mass?Is our modeling sufficient to accurately infer SFH from both old and young galaxies, unlike traditional parametric approaches which suffer at earlier epochs in the Universe's history?Answers to such questions will enable us to discover and correct potential biases in our predictions.
Finally, we also plan on vastly increasing the size of our simulated data to account for the immense diversity in the physics of galaxy formation and evolution modelingdifferent initial mass functions (IMFs) (for example Salpeter 1955;Kroupa 2001;Chabrier 2003), nebular emission models (for example Bruzual and Charlot 2003), mass-metallicity relationships (such as those presented by Tremonti et al. 2004;Jimmy et al. 2015;Lara-Lopez et al. 2013), among others.While this task has a relatively longer horizon-generating and saving new simulations is a compute-intensive task-it is crucial to undertake before data-driven methods such as the one proposed in this paper become trusted by astronomers.

Figure 2 :
Figure2: Kernel density estimate plots of the log of flux densities (FD) for the first three features.The feature names at the top of each plot are the names of the filters, each centered at a different wavelength, in which photometry was simulated.We notice that for the three features shown here, all three simulations share the same support, which justifies our decision of using KLIEP.

Figure 3 :
Figure 3: Global SFH predictions for the three experiments.The curves correspond to the sums over all SFH or predicted SFH.

Figure 4 :
Figure 4: Comparison of SFH reduction approaches for SIMBA.On the left, the evolution of DILATE loss between the true sfh and the reconstructed signal after one of the three transformations: DWT, PCA, kernelPCA.On the right, reconstructed signal for one of the SIMBA SFH.

Figure 5 :
Figure 5: KLIEP brings different domains closer by reweighing source samples.Top: Scatter plot of the first two PCA components in the input space.Bottom: KDE plots of log of the first two features.

Table 1 :
Forecasting results for individual SFH.The first column gives the target domain : ILLUSTRISTNG, EAGLE or SIMBA, the two other domains are used as source domains.The computed metrics between each individual SFH prediction and the corresponding groundtruth are averaged in the target domain.16 th , 50 th , and 84 th quantile values are drawn from the 50 predictions per galaxy, corresponding to the 50 neural networks used.Leading method between UDA and no-UDA (baseline) are shown in bold.

Table 2 :
Forecasting results for ΣSFH (total star formation history).The first column gives the target domain : ILLUSTRISTNG, EAGLE or SIMBA, the two other domains are used as source domains.Predictions for all samples in the target domain are summed and compared to the true ΣSFH according to the different metrics.16 th , 50 th , and 84 th quantile values are drawn from the 50 predictions per galaxy, corresponding to the 50 neural networks used.Leading method between UDA and no-UDA (baseline) are shown in bold.Metrics are scaled (MAE × 1000, RMSE × 1000, BE × 1000, DTW × 1000 for readability.