A review of unsupervised learning in astronomy

This review summarizes popular unsupervised learning methods, and gives an overview of their past, current, and future uses in astronomy. Unsupervised learning aims to organise the information content of a dataset, in such a way that knowledge can be extracted. Traditionally this has been achieved through dimensionality reduction techniques that aid the ranking of a dataset, for example through principal component analysis or by using auto-encoders, or simpler visualisation of a high dimensional space, for example through the use of a self organising map. Other desirable properties of unsupervised learning include the identification of clusters, i.e. groups of similar objects, which has traditionally been achieved by the k-means algorithm and more recently through density-based clustering such as HDBSCAN. More recently, complex frameworks have emerged, that chain together dimensionality reduction and clustering methods. However, no dataset is fully unknown. Thus, nowadays a lot of research has been directed towards self-supervised and semi-supervised methods that stand to gain from both supervised and unsupervised learning.


What is learning?
Learning from astrophysical data consists of extracting knowledge by constructing a mapping between the highdimensional space of the observables to a lower dimensional space through, either an inference tool, e.g.empirical photometric redshift estimation, or an assumed model, e.g.estimating physical parameters from spectra such stellar mass or metallicity.An additional goal, approached as a byproduct of the learning process, is the identification of outliers.These are sources that do not conform to the inferred mapping as measured by some appropriately defined distance or similarity measure.Outlier or anomaly detection is becoming more and more prevalent in exploiting large astronomical datasets in pursuit of transient sources (supernova, gravitational waves), rare sources, i.e, short-lived phases of otherwise normal sources, including planet detection (Sarkar et al., 2022), and in pursuit of the unknown unknowns (Lochner and Bassett, 2021).
The process of applying machine-learning (ML) methods on any kind of data typically encompasses the following steps 1) data gathering and calibration verification, e.g.zero points for photometric data, flux and wavelength calibration for spectroscopy 2) pre-processing, i.e. imputation of missing values, whitening, normalization 3) optionally dimensional reduction 4) hyperparameter tuning 5) performance validation.The first section of Table 1 provides references to relevant reviews, that are not astronomy specific.
Largely defined by the presence of ground truth during the training process, supervised and unsupervised learning come with their distinct advantages and disadvantages.The field of supervised learning has enjoyed significant advancements in the past decade, starting by implications of deep learning, and ranging all the way to modern day transformer architecture.Supervised learning is the mapping between an input space and, a known, ground truth.It can be used to perform both classification and regression.However, it is impossible to extrapolate beyond the properties of the training set.Contrary, unsupervised learning encompasses methods that crystallise neighbourhood relationships in a high-dimensional parameter space, and methods that go a step further into projecting these relationships into lower-dimensional spaces that preserve -to a certain extentlocal or global distances.Hence, the distinct advantage is the opportunity to discover new attributes, or new categories of objects.However, even though continuous maps or ranking of the data can be produced, the main goal of unsupervised learning is grouping the data into clusters.
Alternative approaches that fall in between the supervised/unsupervised division have also been developed aiming to by pass the limitations imposed by limited labelled training samples.These approaches include Self-supervised learning, Semi-supervised learning.Other methods aim to abstract information from the learned features with aim to push machinelearning towards artificial intelligence.These approaches can be categorised under the general terms of Transfer learning, and Representation learning.The second half of Table 1 provides reviews on each of these approaches without being domain specific.We shall return to these approaches in Section §6.
Before we delve into the subject of unsupervised learning in astronomy, we point out that a summary of the ML methods discussed in this work, corresponding acronyms, and references to original papers or reviews on the subject are given in Table A.3 and Table A.4.

A brief historical perspective
Data-driven discovery is part of the astronomers' DNA.Even though the notion of 'Big Data' is a loosely defined, and ever evolving term (see Kitchin and McArdle, 2016,  the significance and challenges associated to acquiring and processing large amounts of data has long standing recognition in astronomy.In the following, we give a very brief historical context of advances in hardware, software, and data availability that have influenced astronomy since the turn of the millennium. 1.2.1.Early digital astronomy (pre-2000) Among the first digitised data have been scans of photographic plates, that already amounted to a few terabytes of data.These include digitised versions of Schmidt plates offered through the SuperCOSMOS Sky Survey 1 (SSS, Hambly et al., 2001Hambly et al., , 2004)), access to plates obtained from German observatories through the Archives of Photographic PLates for Astronomical USE 2 (APPLAUSE), while some of the plates obtained in the USA can be found through the project Digital Access to a Sky Century Harvard 3 (DASCH, Grindlay et al. (2012)) and the Maria Mitchell Observatory 4 .
Pre-2000, major all-sky digital surveys where already in place, including the Two Micron All Sky Survey (2MASS, Skrutskie et al., 2006), collecting a staggering 25TB of imaging between 1997-2001, and software packages such as IRAF had emerged for wider use in the community (Tody, 1986(Tody, , 1993)).However, the wider astronomical community had limited access to computing resources, and data were collected and transferred as physical copies on hard drives and compact discs.Despite the challenges, we already have some applications of unsupervised learning on spectra classification, stellar classification and so on (e.g.Storrie-Lombardi et al., 1994;Hernandez-Pajares and Floris, 1994;Connolly et al., 1995;Lahav et al., 1996;Faundez-Abans et al., 1996;Naim et al., 1997;Bailer-Jones et al., 1998;Galaz and de Lapparent, 1998;Tagliaferri et al., 1999;Lee et al., 1999;Vilela Mendes, 1999).

Computing goes mainstream (2010-2015)
The information technology revolution has shaped a new future.Astronomy collaborations grow larger, with SDSS paving the way.By this point in time, most -if not all -universities have access to significant computing resources, laptop computers have become ubiquitous, while collaborative and opensource software development has given rise to the first release of Astropy (Astropy Collaboration et al., 2013) and Scikit learn (Pedregosa et al., 2011).Ivezić et al. (2014) publish their book on Statistics, Data Mining and Machine Learning in Astronomy, with associated code examples, ready to run on astronomical data.
Deep learning applications have created a profound change in the world.Data are getting ever larger; multi-terrabyte archives exist across wavelengths (VHS, GAIA, DES, etc), with petabyte and exascale astronomy in the works.Access to resources for testing is not an issue as hardware keeps getting better, and even more affordable.Inference at large scale starts to become challenging.

Meta algorithms (2020-)
In later years, the focus has shifted towards 1) developing meta-algorithms and pipelines that either incorporate several algorithms to achieve tailored data analysis 2) exploration of the latest ML methods such as contrastive learning and generative models.We witness an ever increasing interest in the latent space: its robustness, its interpretation, and its limitations (Logan and Fotopoulou, 2020).Domain adaptation methods promise to transfer knowledge across samples, while likelihood free inference / simulation based inference are increasing in popularity, particularly due to their speed and flexibility (e.g., von Wietersheim-Kramsta et al., 2024;Chen et al., 2023).However, as complexity increases, model interpretability is lost.At the same time, physics-aware models try to capture some of the sought-after meaning in the data (e.g., Xu et al., 2023;Moschou et al., 2023).

Machine Learning in Astronomy
Astronomy has, and will, remain enamoured with the possibilities of ML applications as the field faces several of the 'Big Data' definitions (volume, velocity, variety, veracity, etc;Kitchin and McArdle, 2016).Along with the very extensive literature, a number of very detailed reviews exist on various applications of machine-learning and data-driven discovery briefly summarised below.

Previous ML Reviews
Ball and Brunner (2010) provide an overview of data-mining and knowledge extraction from databases, referring to any organised collection of data, including also FITS files.This review provides useful advice on pre-processing caveats relating to astronomy applications from a practitioner's perspective, for example source association and masking, and discuss individual ML methods.Fraix-Burnet et al. (2015) reviewed ML methods used for classification specifically for extragalactic astronomy.The authors discuss both supervised and unsupervised methods for classification, describing commonly used methods and their applications in extragalactic astronomy.Huijse et al. (2015) focused on time-domain astronomy, and in particular on the challenges introduced due to the scale of the anticipated LSST Survey.They showcase that a combination of unsupervised, supervised, as well as an active learning approach, that injects domain knowledge when necessary, is mostly likely needed.Baron (2019) summarised in a practical and pedagogical manner commonly used unsupervised and supervised methods, including performance metrics and instructive toy examples.El Bouchefry and de Souza (2020) provide a brief historical overview of the nascent period of ML, cover definitions of various types of learning, and provide application in astronomy and geoscience.In Doorenbos et al. (2021) the authors discuss a detailed comparison of outlier detection methods, based on SDSS data.
Finally, Smith and Geach (2023) give an excellent pedagogical review on the use of neural networks in astronomy, covering developments from early years of using multi-layer perceptron to modern applications of auto-encoders and generative models.More recently, Huertas-Company et al. ( 2023) provide a review of contrastive learning, a new approach of self-supervised learning.

About this review
In this review, we will focus on the use of unsupervised methods in astronomy in the past 30 years.A broad search of astronomy and other sciences using the NASA/ADS database using the terms 'unsupervised learning', returned just over 1,000 items.We scanned through titles and abstracts for relevant literature and reduced the number to about 500 articles and book chapters.Some of the unrelated publications regarded student supervision and mentorship relationships, learning in terms of education, etc.
Of the 500 relevant papers, we tagged each entry by method used, and area of application.A number of the papers referred to Earth monitoring (including Moon and Mars) applications, volcanic activity, and modelling turbulent fluids.In the following, we will focus on the Astrophysical applications.Some of the Sun and solar system works will be mentioned briefly, but not discussed.We noticed a general trend of silos where communities used a specific code, or approach to model the corresponding data.Hopefully this review will spark some inspiration for testing other methods.The discussion presented in the remaining of this review is of course not limited to the retrieved ∼500 papers, which only served as a starting point.
A few major areas of application emerged during the review of the literature.In addition to the works on anomaly detection briefly mentioned earlier, the analysis of the GAIA survey is in the forefront of ML applications, including the use of unsupervised methods for stellar cluster detection.In extragalactic astronomy, heavy use of unsupervised methods is found in galaxy morphology in optical and radio images, source classification (e.g., star/galaxy/QSO), and spectroscopy.
In the remaining of this work, we will approach the subject neither in chronological order, nor in area of application or even by method.Rather we take the more abstract approach of examining the process of learning from astrophysical data.This includes first a discussion of traditional unsupervised approach including high-dimension space ( §2) and features ( §3), dimensional reduction ( §4), data clustering ( §5).Finally, we briefly discuss very recent applications of self-and semi-supervised learning and domain adaptation ( §6).We close with some recommendations on the way forward for future applications based on the observations made in the literature and our own experience ( §7).

Selfsupervision
We assume the reader is familiar with the nomenclature of machine-learning applications.For the readers that might need in depth explanations of the terminology, we point them to the excellent reviews mentioned previously in Section §1.3.1.Finally, we use the term 'model' to refer to astrophysical models, as the spectral energy distribution of a galaxy and machinelearning models, as well as the algorithm and tuned hyperparameters created to describe a dataset, e.g. a trained selforganising map.The meaning of the term will be specified only when it is not clear from the context.

Input Features and the Curse of Dimensionality
Astronomical features fall in three main categories: 1) observed 2) deduced based on a model or 3) deduced based on data-driven property of the dataset at hand.The first category includes for example fluxes and colours, spectra, and time series.The second category includes properties found usually in a catalog produced with traditional methods, such a Sercic index, Gini index, metallicities, mass, etc.Finally, the data-driven derivation might include PCA components, features learned by a CNN, etc.
Astronomical or other data live in a multi-dimensional space that is impossible to visualise directly if the parameter space spans more than three-five dimensions, e.g. by inclusion of colour and size gradients as extra dimensions.This parameter space is typically sparse and creates computational challenges.Even if the data could be on a narrow hyperplane, uncertainties will inevitably scatter the data above and below this hyperplane.The uncertainties are introduced not only by the instrumental limitations, but also by the intrinsic scatter of physical properties, which in addition might occupy a continuum and not distinct classes that might be found in everyday physical objects.
Employing a clustering method directly on the highdimensional space is rarely feasible or necessary, since we actually expect observed data to be correlated due to the underlying physical emission mechanisms.Thus, unsupervised learning usually starts by reducing the observed or deduced parameter space to a lower dimension space 6 (see Section §4), usually into three to four dimensions (e.g.Sasdelli et al., 2016;Logan and Fotopoulou, 2020), in which clustering analysis is performed (see Section §5).The exact of number of dimensions 6 Supervised methods can also benefit from dimensional reduction but will not be discussed here in detail.
is found through experimentation, by monitoring model performance (Logan and Fotopoulou, 2020) or iteratively through e.g. the average silhouette method as is the case for k-means.

Pre-processing
As is the case for supervised learning, unsupervised algorithms work best when the data have been cleaned and normalised before presented to the algorithm, e.g. by using the StandardScaler in Scikit.Learn.This is common practice in the ML community.In the following, we note a few domainspecific issues relating to astronomical data.

Data Imputation
In ML, substituting missing data values with the mean of the distribution is standard recommendation.However, missing data in astronomy can arise due non-observed parts of the sky, due to technical artifacts (e.g.diffraction spikes), or due to non-detection in particular wavelengths and to a certain sensitivity (depth).The first two cases are non-informative, and can be usually made explicit in a catalog by the use of a placeholder value such as '-99'.The latter case of non-detections however, carries information on a per-source basis.
Astronomical catalogs do not always encode this missing information, or they might include a homogeneous image depth across the catalog, corresponding to the mean depth of the image.This approach is of course common practice due its simplicity.However, as data become larger, it is increasing impossible to resort to new source detection for each source of interest.Therefore, it would be beneficial for many analyses to include the measured flux at the location of the source across wavelengths.

Normalization
Many algorithms treat larger numerical values as more important, hence it paramount that this effect is removed before modeling the data.Standard ML practice includes the 'whitening' and 'normalisation' of the data, whereby the mean of the distribution is centered to zero and the standard deviation is scaled to one.
Importantly, this scaling has to be preserved and applied with the same scaled factor to new data that might be presented to a model at a future instance.

Dimensional reduction
Often, astrophysical features are correlated, and thus we can create a mapping between the sparse high-dimensional space and a lower-dimensional space that capture the majority of the information included in the data.A review of popular methods can be found Jia et al. (2022) and details on the methods can be found in Bishop (2006).
The new representation of the data in the lower-dimension space is called a latent space.Most methods will attempt to create latent space compression that is capturing the most informative aspect of the data, which at the same time are usually non-intuitive to interpret.However, since they aim to reveal the information content on the observed data, they are often used as input in other downstream machine-learning methods, both in supervised and unsupervised tasks.In the following, we discuss popular dimensional reduction methods in astronomy, grouped by the algorithmic approach, i.e. matrix factorization ( §4.1), manifold learning ( §4.2), and networks ( §4.3).Singular value decomposition (SVD) is the most general factorization of an M × N matrix.Namely, any matrix A can be re-written as: where U is an M × N column-orthogonal matrix, Σ is an N × N diagonal matrix, and V is another N ×N column-orthogonal matrix.The diagonal elements of the Σ matrix are called the singular values (see Press et al., 2007, for a detailed discussion).SVD is used frequently for signal decomposition, with applications in astronomy ranging from decomposition of spectra of individual sources (e.g., Simon and Sturm, 1994;Piana and Brown, 1998;Hobbs et al., 2006;Amara and Quanz, 2012;Romano and Cornish, 2017), to cosmology and large scale structure (e.g., Nicholson et al., 2010;Vanderplas and Connolly, 2009;Bennett et al., 2013;Planck Collaboration et al., 2016a).SVD is closely linked to the derivation of principal component analysis (PCA) discussed in the following section ( §4.1.2),as it is computationally more attractive compared to solving for eigenvalues using the determinant of a matrix.

Principal Component Analysis (PCA)
Principal component analysis (PCA, Hotelling, 1936) is the data transformation that identifies the directions of maximum variance in the data, i.e. the principal components are the eigenvectors of the covariance matrix of the data.It is therefore, a linear projection of N-dimensional data, into a d-dimensional space with d < N. The majority of the information is usually captured in the first few components.The physical interpretation of the principal components requires further domain specific investigation.
Calculating the eigenvectors of the covariance matrix could be accomplished using the determinant of the matrix, but that would require the matrix to be square.A very educational description of the more computationally favourable application of SVD is given in Press et al. (2007); Ivezić et al. (2014).Briefly, the first step before searching for the eigenvectors is to remove the mean of the data, as would appear as the first principal component.Let our M observations of N-features be described as an M × N matrix, A. The covariance matrix is: Using the SVD factorization of eq. 1, we have: Therefore, the SVD of the covariance matrix leads to a quick and stable determination of the eigenvalues of matrix A, particularly in the regime of very large number of features.In practice, popular PCA algorithms, including the sklearn implementation use some version of the SVD approach7 .PCA has found an astonishing number of applications in astronomy.A NASA/ADS search at the end of 2023 returns 8.4 thousand referred papers.Likely the first application of PCA in astronomy has been the classification of stellar spectra by Deeming (1964).Further PCA uses in galactic applications include Storrie-Lombardi et al. (1994) where the authors showed that the first five PCA components of stellar spectra can lead to fast and accurate stellar type classification when used as input in a neural network.Deb and Singh (2009) performed a PCA decomposition of star lightcurves showing that PCA can provide a fast alternative classification of variable stars, as a first step before more detailed analysis.Krone-Martins and Moitinho (2014b) introduced the framework 'Unsupervised photometric membership assignment in stellar clusters' (UPMASK) to identify stellar cluster members, applied by many authors on GAIA data.PCA is performed on the photometric data as the first step of the analysis, to identify stars with similar composition.Hayes et al. (2020) used PCA to compress a model library with the aim to extract informative priors to speed up forward modelling of exoplanet spectra.Matchev et al. (2022) used PCA to explore the physical properties of the exoplanet spectra benchmark dataset of Márquez-Neila et al. (2018).
Examples of extragalactic astronomy applications include (Connolly et al., 1995) who used PCA to produce a set of informative photometric filter combinations to infer photometric redshifts.Sulentic et al. (2000) pieced together multiwavelength data to inform the interpretation of Eigenvector 1, the first principal component derived from QSO spectra as shown earlier in Boroson and Green (1992), linked to the Eddington ratio (Marziani et al., 2003).Wild et al. (2014) modeled Bruzual and Charlot (2003) SEDs convolved with photometric filter curves and used PCA components, dubbed 'super-colours' to classify the SEDs of galaxies into star-forming and passive.Lawlor et al. (2016) used PCA and a number of other methods to map SDSS spectra to lower dimensions and explored the correlation of the projected space to physical properties of galaxies.More recently, Logan and Fotopoulou (2020) used PCA as means to reduce the input parameter space before performing star/galaxy/QSO classification.

Kernel PCA
Kernel PCA (Schölkopf et al., 1998) was designed to overcome the limitations of linear projection of complex a data structures.In principle, mapping the data to higher dimensional parameter space would allow linear decomposition with PCA.However, the new space might need to be of too high dimensions, and the mapping function not easily known.The intractable computing difficulties are mitigated with appropriate choice of kernels which substitute dot products in feature space, with kernel functions.It is important to note, that, unlike PCA, kernel PCA does not allow the exact construction of the data but only an approximation.
This method has been applied as pre-processing step for detecting Type 1 SNe photometric classification (Ishida and de Souza, 2013) and lensed QSO (Agnello et al., 2015) using a gaussian radial base function as kernel.Xiang et al. (2017) used kernel PCA with a gaussian radial basis function on LAM-OST stellar spectra to estimate physical parameters of stars.Wang and Bon (2020); Amaya et al. (2020); Irfan and Bull (2021) have applied kernel PCA on solar image observations, Papaefthymiou et al. (2022) on mid-infrared spectra classification of Ultra Luminous Red Galaxies (ULRIGS) and QSO (gaussian kernel).

Independent Component Analysis (ICA)
Independent Component Analysis (ICA -see Hyvärinen and Oja, 2000, for a review) seeks to separate a signal into statistically independent components, e.g.overlapping sound signals, unlike PCA which aims to recover a representation that captures the most dominant components in the data.It is a linear transformation, applicable only on non-Gaussian data.If all components are gaussian, then the resulting mixture is also a symmetric gaussian distribution that does not contain enough information to disentangle the signals.A disadvantage of this method is that the number of components must be defined manually, which can be explored by inspecting the residuals.In addition, the importance of each component must be explored on the basis of domain knowledge and existing understanding of the system.
This method has been used to perform blind source separation, for example on mid-infrared Spitzer maps Meidt et al. (2014) and Cosmic Microwave Background maps (Planck Collaboration et al., 2014).Meidt et al. (2014) used ICA to isolate the light of old stellar populations to derive accurate mass-tolight ratio conversions on the Spitzer Survey of Stellar Structure in Galaxies (S4G, Sheth et al., 2010;Planck Collaboration et al., 2016b).(Cardoso et al., 2008) extended the ICA approach to multi-spectral analysis of the Planck data (Planck Collaboration et al., 2014).

Non-negative matrix factorization (NMF)
Non-negative matrix factorization (NMF Lee and Seung, 1999) was developed as a method that is able to identify parts of objects, such part of a face.NMF mixing matrix is constrained to be strictly positive, contrary to PCA learns a holistic representation of the data and allows subtractions of the components to reconstruct an object.
The first application in astronomy was by Blanton and Roweis (2007) to derive k-corrections on galaxy spectra.They found that a basis of five components can be used to reconstruct the galaxy synthetic spectral library of Bruzual and Charlot (2003).Soon after, Allen et al. (2011) used NMF to create a basis of component that capture the information in QSO spectra.They created a new way to measure the 'balcinity' index of a QSO by projecting broad absorption line QSO (BAL-QSO) spectra on the same basis.Koljonen (2015) found that NMF performed better over PCA and ICA on decomposing the spectrum of X-ray spectrum of the X-ray binary GX 339-4.They find that five components can provide a good description of the data, and linked the components to physical properties of the system.

Non-linear manifold learning
Manifold learning methods aim to capture neighbourhood relationships that exist in the input high dimensional space and create a mapping to a lower dimensional space that preserves these relationships, allowing to visualise of even perform clustering on the data.In this section, we discuss four such methods that are better suited for data exploration and visualisation.All methods below claim to tackle 'short-circuit' effects, whereby a manifold with many folds, like e.g. the swiss-roll dataset, can lead to jumps from one fold to the other.See Meilȃ and Zhang (2024) for a review on manifolds.

Distance & Divergence measures
The most commonly distance measures used in manifold learning are the Euclidean and geodesic distances.The well known Euclidean distance gives the length of a line connecting two points: While the geodesic is the shortest distance between two points, which differs from the Euclidean distance on a non-flat space.
Other useful measures include the Mahalanobis distance that gives the distance between a point and distribution, the Wasserstein distance that measures the 'work' needed to transform on distribution to another, the Shannon divergence that measures the similarity between two probability distributions, the Kullback-Leibler (KL) divergence that measures the relative information content between two distributions, and the Procrustes alignment that is a comparison measure of shapes.

Isometric feature mapping (Isomap)
Isomap (Tenenbaum et al., 2000) seeks to preserve global geometry of the input space, using geodesic distances.The algorithm operates in three steps 1) find neighbours within a certain distance (for computational efficiency) 2) create a graph and find minimum shortest path among all pairs 3) create a lower dimensional embedding for all points.The dimensionality of the data can be found by assessing the reconstruction error as a function of the number of projected dimensions.Generally speaking, Isomap can embed very complex manifolds, but can suffer from high computing time.A weakness of Isomap can be that the embedded points are too close to each other, and often organised in elongated clusters, making it non-trivial to apply clustering methods the data in the new space.
Isomap has found some applications in astronomy.Bu et al. (2014) used Isomap to classify SDSS DR9 (Ahn et al., 2012) stellar spectra.They show that PCA and Isomap lead to different stellar subclasses.Sasdelli et al. (2016) explored a wide variety of unsupervised learning methods to automatically classify Type Ia supernovae.They used Isomap to project to two dimensions the 4d latent space learned by an autoencoder.More recently, Matchev et al. (2022) explored dimensionality reduction techniques as means to find an embedding that will isolate the physical properties of exoplanet transmission spectra.Even though Isomap components capture physical properties such as molecule abundance and cloud opacity, they conclude that PCA is preferred as a more interpretable embedding.

Locally Linear Embedding (LLE)
Locally Linear Embedding8 (LLE, Roweis and Saul, 2000) was designed to overcome the limitations of general linear pro-jection methods (see Section §4.1), by creating local linear decomposition.The steps of the algorithm are largely similar to Isomap with the main difference on the second step.Namely, after finding the nearest neighbours of each point, instead of calculating the geodesic distances between the points as done in Isomap, LLE reconstructs the point in questions using a weighted sum of the neighbours.The fact that the algorithhm operates on small scales on the manifold, provides the flexibility to embed complex data.Contrary to PCA, LLE does not allow for the projection of new data once the manifold has been learned, however it can a powerful tool for data exploration, including the identification of outliers.Vanderplas and Connolly (2009) presented the first application of LLE in astronomy, embedding the SDSS DR7 spectra (Abazajian et al., 2009) into two dimensions.They also presented a workaround for reusing the learned weight matrices of the neighbours in order to project new spectra.For the interested reader, Ghojogh and Sharma (2022) give a detailed overview of the LLE method and its many variants.
LLE applications in astronomy include embedding of stellar spectra (Vanderplas and Connolly, 2009;Daniel et al., 2011;Bu et al., 2013) and lightcurves of eclipsing binary stars (Matijevič et al., 2012;Kirk et al., 2016;Bódi and Hajdu, 2021).Daniel et al. (2011) showed that the more than two dimensions are necessary to embed SDSS spectra, and in particular that most of the stellar spectra fall on a sequence in a 3d space (see their Figure 4).Bu et al. (2013) performed a comparison of LLE against PCA on M-type stellar spectra and found that LLE performs better, apart from the high noise regime.

t-Distributed Stochastic Neighbor Embedding (tSNE)
T-Distributed Stochastic Neighbor Embedding (tSNE, van der Maaten and Hinton, 2008) is another data visualisation method, developed to operate on non-linear manifolds.In particular, t-SNE models the likelihood of the data in the high dimensional space using a Gaussian distribution mapped to a Student-t distribution in the lower dimensional space.The long tails of the t-distribution solve the problem of over crowding present in other methods.Due to the flexible representation of the data, t-SNE can learn more than one manifolds if they exist in the data.However, severe weaknesses of t-SNE include, 1) over-segmentation of the data in the lower dimensional space, i.e. creating more clusters than necessary 2) inability to project new data to an existing map 3) non-deterministic mapping 4) inability to map data to more than three dimensions, particularly problematic if the intrinsic dimension of the data is higher than that.Hence, the algorithm has been recommended since its inception to be used as data visualisation tool.However, parametric t-SNE (van der Maaten, 2009), aims to tackle some of these issues.
Even though the weaknesses described above are clearly stated in the paper of van der Maaten and Hinton (2008), the astronomy community has made several attempts to use t-SNE as means for revealing clustering of data in the lower dimensional manifold.Often, the the projection is arbitrarily chosen to be two dimensional, which is not ideal for astronomical data.In addition, all t-SNE projections mentioned below that include a large number of objects suffer by increased complexity of the projected map making it impossible to group objects on the learned embedding.Kinson et al. (2021) used probabilistic random forest to identify Young Stellar Objects (YSOs).In their Figure 13, they show clearly that the minority class they are interested in, cannot be blindly identified in the t-SNE projection.Čotar et al. (2019) used supervised and unsupervised methods to identify carbon-enhanced metal-poor (CEMP) candidate stars.The authors classified more than 600 thousand spectral from the GALctic Archaeology with HERMES pilot survey (GALAH, Duong et al., 2018).Their Figure 2, shows the clear limitations of trying to group similar objects together.Apart from a prominent group of CEMP stars identified manually, and with a supervised method, other CEMP stars are found spread over the t-SNE projection, and similarity to their neighbours is not evident.Steinhardt et al. (2020) attempted to use t-SNE to identify quiescent galaxies.However, the map produced cannot be generalised.As the authors note, t-SNE needs to be recomputed every time new data need to be mapped.They chose to select a sample of galaxies with very narrow redshift ranges (0.9 < z < 1.1, and 1.9 < z < 2.1).Their test sample sample is drawn from the same redshift distribution.However, if the map should be produced every time, and there exists a training sample, the approach is no different to k-nearest neighbours, since t-SNE finds objects with similar colours using a Euclidean distance.More recently, Youakim et al. (2023) used a known sample of ω Cen stars and a t-SNE projection of chemical abundances and stellar kinematics to identify members of the stellar stream.They defined a convex hull region on the t-SNE map based on known ω Cen stars and performed 100-fold bootstrap to identify the stars that fall within the pre-defined region, to combat the intrinsic stochasticity of t-SNE.
When few data are examined with a variety of classes present, t-SNE seems to perform well in grouping similar objects together.For example, George et al. ( 2017) created a 3d t-SNE map of the learned features of a convolutional neural network applied on the GravitySpy benchmark data.The modest size dataset ( 8,500 elements) shows very well separated clusters.Contrary, t-SNE not only requires significant computing resources, but it also struggles to form distinct clusters as the number of points increases (roughly > 10 5 ).In such cases, a t-SNE projection can be used as data mining tool, for example to search for outliers (Giles and Walkowicz, 2019;Webb et al., 2020b) or visualise the learned features of other methods (Khan et al., 2019;Chen et al., 2020a).
From astronomical works available in the literature using t-SNE, we can conclude a few general trends: 1) 3d projections seem to separate the data better than 2d projections 2) t-SNE cannot identify well minority classes 3) 2d maps can be used in conjunction with domain expertise to explore the map for anomalies 4) the shape of the embedded map is usually too complex for automatic clusterers (see section §5).Often knearest neighbours can be used as an alternative if the aim is to search for similar objects based on a training set rather than searching for neighbours in the t-SNE projection.

Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP)
Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP, McInnes et al., 2018) is a more recent development in manifold learning.UMAP aims to better preserve the global structure of the data compared to t-SNE, with better scalability to larger datasets, and no restrictions on the number of dimensions.UMAP is based on the underlying assumptions that 1) the data are uniformly distributed on the manifold and 2) the underlying manifold is locally connected, i.e. there are no holes.As long as these conditions are satisfied locally, UMAP can be used to create a directed graph to model the k-nearest neighbours.
Due to the stochastic approach on nearest neighbour search and gradient descent, UMAP, like t-SNE, does not create a deterministic mapping.This is a significant drawback when creating an embedding with subsample of the data in order to perform clustering in the lower dimensional space.Figure 7 of McInnes et al. (2018) shows the alignment of various projected subsamples.Even though UMAP is more stable than t-SNE, a number of clusters are misaligned, hence it is not recommended to use any of the previous methods for classification, but rather only for data exploration and visualization.
Astronomy works using UMAP span fields from molecules (Lee et al., 2021), to stars (Sanders and Matsunaga, 2023) and galaxies (Vega-Ferrero et al., 2023), and across wavelengths.Storey-Fisher et al. (2021) searched for anomalies using a generative adversarial network.They used UMAP as a visualization tool, and they show that their detected anomalies tend to cluster together.Clarke et al. (2020) used Random Forest to classify 111 million sources from SDSS.They showed that UMAP creates satisfactory clustering of spectroscopic data.However, we need to caution against assuming that the same trend generalises to photometric data (C.Logan private communication), as seemingly well separated classes gradually overlap and new, previously not seen clusters are created ad hoc.The same phenomenon is also demonstrated in Fig. 21 of Clarke et al. (2020).Chen et al. (2022) used UMAP to project 13 input features of about 600 fast radio bursts (FRB).UMAP creates nine, very distinct, clusters.This is a similar phenomenon that has been observed with t-SNE, namely the learned embedding can be informative when examining smaller datasets, while a large dataset (> 10 5 ) will be mapped onto a continuous distribution, e.g.seen in the case of galaxies (Clarke et al., 2020;Storey-Fisher et al., 2021;Slijepcevic et al., 2023).This effect is seen in the UMAP representation of repeating FRB 20201124A in Chen et al. (2023).The authors expanded on their previous work, this including 1,745 FRBs.They find that the UMAP projection starts to become crowded.As has been noted with t-SNE, a 3d map perhaps would have been more informative.For example, Ricketts et al. (2023) used a 3d UMAP projection, among other methods, to show the rich behaviour of lightcurve segments observed in the X-ray binary GRS1915.

Networks
Even though technically networks fall under the non-linear manifold learning category, they deserve a dedicated discussion due to their different approach on modeling the manifold and their popularity.

Self-organising maps (SOM)
Self-organising maps (SOM, Kohonen, 1982), or Kohonen networks, are an iterative dimensional reduction method based on competitive learning.To begin with, the geometry of the final map is chosen, e.g. an N × M rectangle.Each cell on the map corresponds to a neuron with a randomly or otherwise (e.g.PCA) initialized weight vector.Each neuron is assigned to the most similar data point, hitherto known as the 'best matching unit' (BMU).At the next step, the weight vector of the BMU and its neighbour are adjusted according to the assigned input data.The assignment between data and neurons is achieve through a distance metric, usually Euclidean.
To this day, SOMs continue to be very popular in astronomy and have found applications across all areas galactic and extragalactic astronomy, only a few can be highlighted here.Early attempts to demonstrate the applicability of SOMs on astronomical data include Hernandez-Pajares and Floris (1994) which grouped the Hipparcos catalog of stars, along with synthetic data, to identify stellar components in the Milky Way (disc, halo), and Maehoenen and Hakala (1995) which a SOM to perform point-source detection, providing as simulated point-sources as input image stamps.(Galvin et al., 2020;Mostert et al., 2021;Gupta et al., 2022) images.
Photometric redshift estimation is the determination of galaxy redshifts based on they observed colours (see Salvato et al., 2019, for a review).Redshift estimation with SOMs relied on the fact that galaxies with similar colours are expected to be at similar redshifts.Hence, a SOM map can be use to infer photometric redshifts of a galaxy populations with a restricted spectroscopic sample (Geach, 2012;Way and Klose, 2012;Carrasco Kind and Brunner, 2014;Masters et al., 2015;Speagle and Eisenstein, 2017;Süveges et al., 2017;Wright et al., 2020;Stölzner et al., 2023).Even though labels are needed to assign redshifts, this application is more akin to label propagation as the labels themselves are not part of the training.In a similar fashion, recently SOMs have been used to estimate physical parameters of galaxies (Hemmati et al., 2019;Davidzon et al., 2022).
Finally, Rajaniemi and Mähönen (2002) used a SOM to investigate the three Gamma-Ray Burst (GRB) classes reported in Mukherjee et al. (1998) (discussed in detail in Section §5), Figure 4: Autoencoder network, figure adopted from Kramer (1991).and find no significant evidence for the existence of the third class (see also, Hakkila et al., 2003, for a discussion on sample incompleteness).

Auto-encoders (AE)
Neural-networks have been among the first ML methods to be embraced in astronomy (e.g., Adorf and Meurs, 1988;Odewahn et al., 1992;Lahav et al., 1995;Bertin and Arnouts, 1996).Auto-encoders (AE, Kramer, 1991Kramer, , 1992)), as their unsupervised counterpart, are receiving heightened attention as they are able to benefit from convolutions, useful for detecting features on images, while at the same time creating a reduced, informative, latent space.They comprise two components; first, the encoder part is a series of layers of progressively lower number of neurons, while the decoder starts from the narrowest part of the encoder, progressively expanding to layers of more neurons.The decoder part can be a mirrored architecture of the encoder.The narrowest part of the network is the latent space, also known as the bottleneck.AEs receive as input data which also form the target.Therefore, using the same training strategy of supervised neural networks, the weights and biases of the model are trained through backpropagation with the goal to replicate the input data minimising the reconstruction error.This means that the latent space carries enough information to be interpreted as a compressed view of the input.
Recently, AEs have received a lot of interest with further complex architectures being developed, including denoising Auto-encoders (DAE, Vincent et al., 2008)  The flexibility of AEs has inspired many astronomical applications.Some examples9 include morphological classification (Ma et al., 2019;Chang et al., 2021;Spindler et al., 2021;Tohill et al., 2023), galaxy lens detection (Cheng et al., 2020), physical parameter estimation (Frontera-Pons et al., 2017), lightcurve classification (Tsang and Schultz, 2019), outlier detection (Liang et al., 2023;Han et al., 2022), among others.Lahav et al. (1996) used 13 catalogued properties of the ESO-LV catalogue to identify galaxy classes.The input features included colours, parametric (de Vaucouleurs red and blue exponents) and non-parametric (light-ratios, asymmetry, surface brightness, etc) quantities.They show that the 'encoder' network captures more information than linear projection with PCA.In the radio domain AE example applications include Ma et al. (2019)  Further examples on physical parameter estimation include Frontera-Pons et al. (2017) showed that the 2d latent space of a denoising AE applied on galaxies with redshifts 0.1 < z < 1 shows a clear distinction between the star-forming blue cloud and the passive red sequence of galaxies.Tsang and Schultz (2019) presented an architecture that extracts features, classifies, and performs anomaly detection on star lightcurves.
AEs have also found application in the gravitational wave field.In particular, Sakai et al. (2022) use a VAE to extract features from the 2d time-frequency spectrograms provided by the Gravity Spy dataset (Bahaadini et al., 2018) clustered later with other methods to identify groups of similar signals.On the other hand, Moreno et al. (2022) used AEs to learn noise data and identify true signals as outliers.Finally, Shen et al. (2017) used a DAE as part of larger ML framework to denoise gravitational wave signals, while Yang et al. (2023) used the Noise2Noise approach (Lehtinen et al., 2018) to suppress instrumental noise on gravitational wave data.

Clustering
Once the observations have been distilled into a space that carries the majority of in the information present in the data, the next step is the ranking of their similarity, leading to grouping of similar objects together.Jain et al. (1999) provide a review of core concepts and methods on ranking and clustering data.Their Figure 7 shows the taxonomy of the various methods commonly used.Detailed discussions on data clustering methods can be found in Aggarwal and Reddy (2013).
In the following, we discuss some commonly used algorithms grouped by cluster identification philosophy: centroid based partitioning, hierarchical clustering, and density based clustering.

Dissimilarities
We first define the notion of dissimilarity.Much like the distance measure was introduced for learning manifolds (see Section §4), a pairwise dissimilarity measure is needed to assess if two points should belong to the same cluster.A common choice is none other than the squared distance between two points x i , x j (Hastie et al., 2009): (5) The extension of the pairwise dissimilarity to a group of points is introduced through the linkage, which can be complete, single, average, or centroid (e.g.see James et al., 2014).They each correspond to a property of the pairwise dissimilarities of two clusters; complete linkage is the maximum pairwise dissimilarity between two clusters, single linkage is the minimum dissimilarity, while average linkage is the average pairwise dissimilarity, and finally centroid linkage is the dissimilarity of the centroids.

Centroid Based clustering
Centroid or partition based algorithms (k-means and its variants) start by a predefined number of clusters to be found within the data.The data are assigned to the k clusters through an iterative process which progressively minimizes the total sum of the distances between the data points and the center of their respective cluster.

k-means
K-means (MacQueen et al., 1967;Lloyd, 1982), much like PCA, has been used in astronomy in numerous occasions.A heuristic partitioning of the data with this method is quick and easy to implement: we assign n data points into k clusters by minimizing the intra-cluster sum of squares with an iterative procedure.First, pick k random locations that will act as the first guess of the cluster centers.Next, we calculate the Euclidean distance of each point from the cluster center, assign each point to its closest center.Recalculate the cluster means from the members.Repeat until the cluster center does not change any more.This algorithm works best when the clusters are well separated.The number of clusters can be deduced heuristically with the elbow method, i.e. by monitoring the sum of the distances as a function of number of clusters.
Due to the simplicity of the method, k-means has been applied to a variety of astrophysical applications, including stellar spectra (Sánchez Almeida and Allende Prieto, 2013;Garcia-Dias et al., 2018, 2019), the debate on the number of GRB classes (Hakkila et al., 2003), clustering of X-ray spectra (Hojnacki et al., 2008), optical galaxy spectra (Sánchez Almeida et al., 2010) and so on.In particular, Sánchez Almeida et al. ( 2010) looked for k clusters in 900k SDSS spectra and highlight the importance of data normalisation.They note that if the data are not scaled to a common flux level, in their case the g − band, the classifier is driven by the flux of the source.They find that k = 17 contains 99% of the galaxies.In follow up work, Sánchez Almeida and Allende Prieto (2013) applied k-means clustering to the SEGUE star sample, obtained also with SDSS (Yanny et al., 2009), finding 16 classes in the data.Garcia-Dias et al. (2018) applied k-means with k = 50 on the APOGEE SDSS stellar spectra, and later merged manually the classes into nine groups, driven by measured physical properties (temperature, special gravity, and chemical abundance).

k-medoids & c-means
Among the most rigid assumptions of k-means is the fact that the centre of the cluster can be an arbitrary point, and the fact that the membership assignment is strict.The k-medoids and fuzzy c-means algorithms are attempting to relax these conditions.These improvements have motivated applications in a wide variety of astronomical areas.
The k-medoids (Kaufman and Rousseeuw, 1990) algorithm was an update to the k-means algorithm, that inspired many more optimisations (e.g.Ng and Han, 2002;Park and Jun, 2009).Similarly to k-means, the expected number of clusters has to be determined by the user.However, contrary to k-means, the center of the cluster is one of the data points, and the distance of a data point to the center of the cluster can be any arbitrary similarity measure, not necessarily the Euclidean distance.Applications include globular cluster membership allocation (Pasquato and Chung, 2019) clustering of the latent space representations of galaxy morphologies to improve the Hubble sequence in a data-driven way (Cheng et al., 2021) and clustering of eclipsing binary light curves (Modak et al., 2018).
Fuzzy c-means (Bezdek et al., 1984) is a soft classifier.Contrary to k-means and k-medoids where a source can belong to only one cluster, c-means allows a fuzziness which controls the degree up to which a source is permitted to belong to more than one clusters.This classification can be seen as a probabilistic assignment, a desirable trait for astronomical data.Barra et al. (2008) and Benvenuto et al. (2018) used fuzzy c-means on solar imaging and solar flare classification respectively, while Jamal et al. (2018) applied the same method to assess the reliability of spectroscopic redshifts.

Hierarchical clustering
Hierarchical clustering methods do not use a predefined number of clusters.Instead, they use a dissimilarity measure and a stopping criterion to aggregate, or break up the data into clusters.
Agglomerative clustering (AL Johnson, 1967) is a 'bottomup' clustering approach.Initially, each point is considered its own cluster.A dissimilarity measure is employed to merge clusters that are located close by, creating a dendrogram structure.There is an extensive list of linkage criteria that has been used in the literature.Divisive clustering is the inverse procedure compared to agglomerative clustering.It's a 'top-down' approach, where the entire dataset is considered one large cluster, progressively split to sub-clusters based on a dissimilarity criterion.A dendrogram is created, starting by the objects that have the largest dissimilarity.
To name a few examples, hierarchical clustering has been applied on galaxy morphologies (Hocking et al., 2018;Martin et al., 2020;Dai et al., 2023), as part of a hybrid neural network for point-source identification (Andreon et al., 2000), stellar type classification (Garcia-Dias et al., 2019), and fast radio transients (Aggarwal et al., 2021).Hojnacki et al. (2008) used agglomerative clustering as part of their clustering framework to identify the number of classes within the data, which were later retrieved with k-means.

Minimum Spanning Tree (MST) or Friends of Friends (FoF)
The Minimum Spanning Tree (MST, see Graham and Hell, 1985, for a historical review) is better known as the Friendsof-friends algorithm in astronomy, used to measure the clustering of galaxies in 2d and 3d space (Press and Davis, 1982;Einasto et al., 1984), which is outside the scope of this review.A MST is an acyclic graph that provides the shortest path between two points, hence it can be interpreted as the geodesic.Examples of MSTs applications relevant to source classification include refining the cluster membership fist found through k-means (Cantat-Gaudin et al., 2019), and similarity search of supernova lightcurves (de Souza et al., 2023).MST is used as part of Isomap (see Section §), and (H)DBSCAN (see Section §5.6).

Probabilistic Clustering
Probabilistic clustering methods aim to model the observed data distribution as a random variable drawn from an underlying multivariate distribution (Aggarwal and Reddy, 2013, p. 61-86).A very commonly used method is Gaussian Mixture Models (GMM, Dempster et al., 1977).As the name implies, GMMs aim to approximate the underlying distribution that generated the observed sample as a mixture of K gaussians.Thus, the clustering problem is transformed to an optimisation problem under which we are trying to find the optimal number of gaussians, and their means and variances that will generate best the observed data.A typical method to solve this optimisation problem is the iterative Expectation Maximization (EM Dempster et al., 1977;McLachlan, 2015) algorithm.The choice of the optimal model parameters can be done with information criteria, such as the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC), which penalize models with very high number of free parameters.Meingast et al. (2017) introduced Pnicer, a GMM to model the line of sight extinction of the interstellar medium.GMMs have been used frequently to model the latent space of VAEs, for example to model X-ray Chandra data of Tycho's supernova (Iwasaki et al., 2019), Cheng et al. (2020) to model optical images with strong gravitational lenses, and Karmakar et al. (2018) to detect stellar clusters in images.We will revisit this type of application in Section 6.1.
GMMs have also been used to model galaxies, for example in terms of kinematics (e.g., Ortega-Martinez et al., 2022;Du et al., 2019Du et al., , 2020) ) and as a population (Fraser et al., 2023).In the latter, the authors modeled the IllustrisTNG-100 simulation as if was an observed population.They used GMM to extract three clusters of galaxies based on their broad band photometry.

Density Based Clustering
Density-based Spatial Clustering of Applications with Noise ( DBSCAN Ester et al., 1996), and its hierarchical extension ( HDBSCAN Campello et al., 2013, 2015), approach the presence of clusters within data from the perspective of minimum number of objects within a given radius.This definition can be applicable to any number of dimensions, but it does suffer from the curse of dimensionality which naturally makes the data sparse.However, since the input features can be often correlated, a workaround is to apply these methods following a first dimensional reduction of the feature space to about five dimensions (e.g., Logan and Fotopoulou, 2020).Ester et al. (1996) used the concept of 'core points' and 'border points' to track the density of objects within a dataset.As the minimum number of neighbours must be predefined, we must take into account that border points will naturally have less neighbours.They define as 'clusters' collections of points that are density-reachable within their respective clusters and as 'noise' points that do not belong to any cluster.By tracking the connectivity of the points, DBSCAN is able to identify non-convex clusters, which is not the case for centroid algorithms.This is particularly of use in astronomy for spatial applications such as non-parametric galaxy morphology detection (Tramacere et al., 2016) and stellar streams (e.g., Rudick et al., 2009;Kounkel and Covey, 2019).
HDBSCAN, as the hierarchical extension of DBSCAN, has found also a wide usage in recent applications.HDBSCAN uses minimum spanning trees which allow the discovery of clusters with varying density, contrary to DBSCAN that set the density as a constant across the entire dataset.The implementation of McInnes et al. (2017) provides also the detection of outliers in the data, and prediction support.
Applications of HDBSCAN on star clusters and stellar streams dominate the astronomical literature (see Helmi, 2020, for a review of GAIA results).Kounkel and Covey (2019) used five dimensional input (galactic coordinates, parallax, and proper mottons) to search for stellar streams in the GAIA DR2 data, expanding previous work of Cantat-Gaudin et al. (2018) to much larger scales.However, their find that the resulting recovered clusters depend on the algorithm configuration due to the fact that stellar clusters that are further away will naturally have a smaller extent on the sky, and smaller parallaxes.Their final catalog contains 1,901 individual clusters of about 288k stars, a very small fraction of the GAIA DR2 catalog.Further works searching for stellar clusters in GAIA DR3 have continue to provide a refined view of the Milky Way local neighbourhood (e.g.Moranta et al., 2022;Gagné et al., 2023) Logan and Fotopoulou (2020), searched for optimal configuration to split the 100 million source KiDS dataset into stars, galaxies, and quasars.As mentioned earlier, they found that a dimensionality reduction to three to five dimensions was a necessary pre-processing step to be able to apply HDBSCAN successfully in photometric colour space.Webb et al. (2020aWebb et al. ( , 2021) ) presented ASTRONOMALY, a framework for transient discovery and lightcurve classification.The main engine of their system is HDBSCAN clustering combined with Isolation Forest (IF, Liu et al., 2008).Aggarwal et al. (2021) tested a number of unsupervised methods for classification of single pulse radio transients.They concluded that either DBSCAN, or HDBSCAN can be used for their application, with preference on DBSCAN.

Modern approaches to learning
Up to now, we have discussed commonly used approaches to unsupervised learning, namely dimensional reduction and clustering.Modern takes on learning, increasingly lean on model ensembles and frameworks that combine ML algorithms in sophisticated ways.The sharp distinction between supervised and unsupervised learning is becoming more blurred by approaches including self-supervised, semi-supervised, as well as transfer learning and domain adaptation.Relevant reviews on these topics are listed in Table 1.In the following, we discuss some of the approaches found in the astronomy literature, nonetheless this is a very active field with interesting methods appearing regularly in the literature that could find applications to astronomy soon (e.g., Berthelot et al., 2019;Yoon et al., 2020).

Modelling the bottleneck
Due to the compact representation of the data offered by AEs, several works have recently attempted to create a combination of architectures to model their bottleneck.These strategies take advantage of the non-linear projection of data, to the expense of feature interpretability.For example, Karmakar et al. (2018) modelled the bottleneck with Gaussian Mixture Models in order to identify stellar clusters in images.Villar et al. (2020) used a combination of Gaussian processes for feature extraction, a recurrent-autoencoder for dimensionality reduction, and supervised random forest (Breiman, 2001) as their final classifier.Tsang and Schultz (2019) used the latent space modelling of Zong et al. (2018), to classify variable star lightcurves.The framework comprises a recurring neural network AE and a GMM to model the latent space.The GMM is further used to provide one-hot classification.Therefore, their dual-network approach classifies the data, and at the same time identifies anomalies.The authors find that their approach achieves accuracies comparable to supervised methods, for known classes.Ralph et al. (2019) (Fig. 5) used a series of algorithms to project radio images into progressively lower dimensions, ultimately clustered with k-means.However, as the authors discuss, significant assumptions are made when assigning hard boundaries of classes on a SOM, that should not be left without further investigation, especially keeping in mind that SOM cells do not form pure categories.
Recently, Forest et al. (2019) proposed a method that combines an AE with a SOM at the bottleneck, training both at the same time.Inspired by this idea, Mong et al. (2023) developed a framework to classify 'real vs bogus' transient sources.They find that the DESOM framework requires long training times, therefore they decoupled the AE and the SOM, by using a SOM projection which takes as input the latent space of the AE.However, the authors find that the current performance of their approach is not competitive against convolutional neural networks, trained on the same data.Even though the framework can be used as extra flagging, significant work is needed, in particular in extracting features from the images, and modelling the bottleneck.

Weakly-supervised
Weakly supervised is learning from noisy or poorly labelled data.In their review, Zhou (2017) highlight three situations that lead to weak supervision.Incomplete supervision reflects the fact that not all classes have assigned labels.This is a very common phenomenon in astronomy, as even the definition of boundaries between classes is non-trivial.Inexact supervision corresponds to the fact that a label might represent a particular part of an images, e.g.morphology of the central galaxy, while neighbours are present in the same cutout.Finally, inaccurate supervision refers to the fact that labels might be wrong.This is common occurrence in astronomy, whether the labels come from automated pipelines (see Pâris et al., 2018;Alexander et al., 2023, for examples in SDSS and DESI spectroscopy) or human annotators (see Huertas-Company et al., 2015;Walmsley et al., 2020, for examples of vote distributions among experts and volunteers alike).
All the above are present in astronomy data, with the added complication that the existing labels are more often than not a biased sample, e.g.mostly high signal-to-noise ratio.In the following, we discuss further only the methods related to unsupervised learning, relating to incomplete supervision.

Semi-supervised
Particularly true for astronomy, labelled training sets are bound to be not only the best quality examples but also typically well understood sources.Therefore, inferring on a larger sample leads to out-of-distribution problems.Semi-supervised learning blurs the line between the traditional unsupervised vs supervised split, by using both labelled and unlabelled data during training.
In their review, van Engelen and Hoos (2020) summarise the underlying assumptions that need to hold true for semisupervised learning to work (see also Chapelle et al., 2006, for detailed discussion).These are the following assumptions: • Smoothness: any two input points close to each other should lead to close by points in the target distribution.• Cluster10 : points belonging in the same cluster, should belong to the same class.
• Manifold: the data can be embedded in a lower dimensional manifold.
There is a number of strategies utilised for semi-supervised learning, largely split into inductive and transductive strategies.
The former aim to create a predictive model, i.e. a model that can be applied to unseen data, such as a neural network.Contrary, the latter aims to answer the problem within the dataset available to the algorithm, e.g.find a lower dimensional embedding.The approach of combining known and unknown samples is standard practice in astronomy, therefore there is a number of applications of semi-supervised learning, including applications from AGN to supernova classification (Lawlor et al., 2016;Villar et al., 2020;Slijepcevic et al., 2022, to name a few).
A related approach that is outside the scope of this review is the case of active learning, which progressively incorporates the most uncertain prediction, as chosen by the model, drawn from the entire dataset.The data point is then presented to an oracle and it is added into the training set, and the loop starts again (Settles, 2009;Stevens et al., 2021;Lochner and Bassett, 2021;Walmsley et al., 2020).

Self-supervised
Self supervised learning learns first a pretext task, aimed to extract semantic information from the data.The pretext task comprises training a network to identify the original image based on transformed versions of the data (i.e.rotated, occulted, colourised, etc).The learned model is then used as initial guess when the downstream task is needed, i.e. classification or object detection.Two main approaches are used in the literature for pretext training: 1) contrastive learning, and 2) non-contrastive learning.In the first case, the algorithm is presented with variations of the positive example, such as zoom in and zoom out views of the image, cutouts, rotated, or scaled version of the original.At the same time some negative examples are also given to the algorithm.The pretext task, is to place the learned representations of the positive examples as close as possible in the learned latent space, while at the same time keeping the negative examples further away.This contrastive learning forces the algorithm to generalise the learned features of the positive class (for a recent review see Huertas-Company et al., 2023).The second non-contrastive approach uses only positive examples as input.This strategy involves 'safety gates' such stop-gradient operations to ensure that the model will not suffer mode collapse, i.e. learn an embedding that is of lower dimension than required to describe the data.Some of the works described previous ( §6.1), might fall under the non-contrastive learning, as long as the input data have been augmented.Since the learned representations form an abstract description of the data, they can be used input in further downstream tasks, which is a very desirable feature.
In the method A Simple Framework for Contrastive Learning of Visual Representations (simCLR, Chen et al., 2020b) the authors highlight that the major reasons of the improved per- formance of their approach are due to 1) multiple data augmentations 2) non-linear transformation between the representation and the contrastive loss 3) normalized embeddings and 4) deep and wide architecture, large batch sizes, and longer training time.In a sense, the above observations offload the effort of labelling to computing time11 , which depending on the application might or might not be desirable.On the other hand, Bootstrap Your Own Latent space (BYOL, Grill et al., 2020) is an example of the non-contrastive selfsupervised approach.BYOL does not use negative examples during the learning phase, which would be used, e.g. in the case of simCLR, to avoid mode collapse.Instead the authors used two networks, the online and target neural networks which are built to interact and learn from each other.The online network, predicts the target network's representation of a different augmentation of the same image.The learning is refined by iterating the procedure flipping between the online and target networks.This method has been shown to outperform simCLR on everyday images, and it is slowly gaining popularity in astronomy (Guo et al., 2022;Slijepcevic et al., 2024).

Transfer learning & Domain adaptation
Another approach to overcome the bottleneck of obtaining large labelled data and working with out-of-distribution data is transfer learning.Domain adaptation is a subset of transfer learning, whereby the objective is to match a source distribution, S, that contains labelled data, to a target distribution, T , that might be data not drawn from identical distribution as the source.The goal of domain adaptation is to bridge the difference between the source and target distributions, often called a shift.As the ultimate goal is to link the distribution p(x, y) = p(y)p(x|y) in the source p S (x, y) and target domains Transformation equations are being used to, e.g.map instrument photometric systems to AB magnitudes, or to map from one standard photometric system to another (Bessell, 2005).Such transformations are a precursor of domain adaptation, which is nascent field in its modern form in astronomy.Recent applications include unsupervised domain adaptation of stellar spectra (O'Briain et al., 2020).They used the UNIT framework (Liu et al., 2017) which creates a common latent space created with a VAE, trained under adversarial conditions.O'Briain et al. (2020) created two latent spaces, one for synthetic and real data and one only for real data.They found that the shared latent space was key in linking between data and synthetic spectra.Other recent applications include the prediction of physical parameters using simulation-based inference with a variety of methods, (kernel PCA, Gilda et al., 2021), including supervised approaches that are outside the scope of this review.Ćiprijanović et al. (2023) created a universal domain adaptation framework, DeepAstroUDA, using semi-supervised learning, with application on galaxy morphology.In their paper, the give an excellent overview of domain adaptation methods relevant to astronomy.Their method combines the key ingredients of cross-entropy clustering to cluster source data, adaptive clustering to associate unlabelled data to labelled examples, and finally entropy loss to separate emerging new categories from data, including also outliers.They show that their method transfers knowledge successfully in the feature space, applicable in real-world and astronomical data.
We are confident that domain adaptation will find soon many more applications in astronomy, as we embrace simulationbased or likelihood-free inference.

Recommendations
Below are a few general recommendations based on past experience, and many discussions with ML practitioners.See also Buchner et al., sub.for recommendations to newcomers in the field of ML in astronomy.

Appropriateness
Machine-learning should be used only when necessary14 .Certain ML approaches can be used as alternatives to traditional methods, such as least squares minimisation, keeping in mind that they will carry all the prior assumptions and biases of those methods.ML methods can be used also as exploration tools, such as finding the number of classes in the data, or looking for outliers.The pitfall in such exploration is that we always get back an answer.This means that we might look for distinct classes when a continuous distribution is more appropriate, or use a very biased sub-population to draw conclusions for the parent distribution.Manifold learning methods such as tSNE and UMAP are particularly prone to hallucinate new clusters.

Benchmarks
Computer science practitioners have invested many years to create benchmark datasets (MNIST, CIFAR, etc).Astronomy applications are far from such widely used benchmarks, probably for good reasons.However, data challenges have emerged for specific applications that pit codes against each other (e.g., Hildebrandt et al., 2010;Euclid Collaboration et al., 2020;Savić et al., 2023;Hartley et al., 2023).Such data challenges, while very useful, need to be interpreted with a pinch of salt when run on simulated datasets, due to domain shifts expected between data and simulations (see Section 6.4).

Visualisation
Input space: Topcat (Taylor, 2005) is an extremely powerful data exploration tool.We recommend to always plot projections and 3d plots of the input parameter space.Astronomy data are full problematic artifacts that creep into the final products, even for the most sophisticated pipeline.Data correlation matrices, when the number of dimensions allows, are very useful to drop extraneous features.
Latent space: When examining a latent space, it is important to not over interpret structures, critically important when examining the latent representation of a labelled sample (e.g.spectroscopic sample of galaxies).As shown in Figure 3, a latent space of a subsample might show different substructure compared to the parent sample.
Output: Confusion matrices are very informative, especially when more than two classes are involved.It is good practice to write the number of objects per cell in addition to the colourbar.We recommend that the colourbar is scaled fixed to the range [0-1].We motivate the field to adopt as standard practice to assign uncertainties in the model performance.Random seeds should be kept fixed during development, and mentioned in publications.However, they can induce significant variation in the trained model when left free.

Generalisation
New ideas must necessarily pass through a phase of experimentation and exploration before they are widely accepted.ML is no different.The enthusiasm of getting closer to 'intelligent systems' and 'knowledge discovery from data' is palpable.However, it is paramount that we distinguish between theory and practice.Many works have used simplified, simulated data to demonstrate zeroth order feasibility of a methods.The reality of astronomical data does not stop there.Missing data, either not observed or not detected, noisy data, instrumental effects, and so on, need to be incorporated as part of the modelling, either directly in the algorithm (e.g.bayesian neural networks) or through bootstraping.
It is worth noting, that some areas such as gravitational waves are in the exact opposite situation compared to galaxy evolution.Namely, gravitational waves benefit from a strong theoretical foundation and are currently in the process of collecting larger observed samples.The approach and goal in this case is different, where ML is used as part of the data reduction (e.g.denoising) or signal detection as anomalies, and not for clustering.

Barriers to discovery
We recommend to bravely explore higher dimensional spaces and prioritise continuous distributions over classification whenever possible.It is tempting to refine the Hubble morphology sequence of galaxies using a data-driven approach (e.g.Cheng et al., 2021, for an interesing discussion).Galaxy morphologies, nevertheless, mostly likely occupy a continuous space and discovery tools based on similarity 15 might be more informative (e.g., Walmsley et al., 2022).
Often the model hyperparameter choices are driven by prior knowledge and intuitive expectations based on highly biased and incomplete examples drawn from past datasets.We recommend to trust the data over models when the goal is discovery of new phenomena.Allen et al. (2024) in their review, discuss categories of discoveries, model intrepretability and validation in ML.

Summary
This review summarised the usage of Unsupervised Learning in Astronomy applications.Following a learning workflow, we summarised the most popular methods used in astronomy highlighting only some of the many thousands published works.We observed a few patterns: some works set out to use machinelearning to solve a well defined and specific problem, other works use a suite of algorithms on a common training dataset, others use some machine learning as part of larger analysis framework.Recently, a number of works combine machinelearning algorithms in very complex pipelines, some times with 15 https://mwalmsley-decals-similarity-similarity-papkyg.streamlit.app/unclear advantages.The field is moving towards complex architectures, which aim to link domains in order to extract knowledge by leveraging past efforts.
However, no method is infallible.The choice and training of a machine-learning algorithm combined with implicit and explicit prior assumptions, introduced for example in the construction of the training set, harbour biases that might be difficult to diagnose.It is important that the scientific question is well defined before settling on any methods.Even though experimenting with new ML algorithms is useful, and certainly fun, domain knowledge shall remain the ultimate tool in the astronomer's toolbox.

Figure 1 :
Figure 1: Graphical overview of unsupervised learning in astronomy.

Figure 2 :
Figure 2: Dimensional reduction with (a) Isomap, which finds the geodesic of neighbouring points, and b) Locally-linear Embedding, which find neighbourhoods where linear approximation holds.Both methods as showcased on the 'Swiss roll' dataset, where the curve of the manifold could cause a 'shortcircuit' if a large neighbourhood is chosen.Fig.(a) adopted from Tenenbaum et al. (2000), fig.(b) adopted fromRoweis and Saul (2000).

Figure 3 :
Figure 3: UMAP and t-SNE projections of 10% subsample (red) and full (blue) flow cytometry dataset.If is visually clear that projection of new data on clustered spaced based on either of this projections leads to gross misalignment, here defined using Procrustes-based alignment.Figure adopted from McInnes et al. (2018).
Automated classification of stellar spectra has remained among the most researched topics, as it has traditionally relied heavily on human visualisation.SOM applications on stellar and galaxy spectra include Xue et al. (2001); Teimoorinia et al. (2022), while lightcurve classification with SOM has been attempted by Armstrong et al. (2016); Sasdelli et al. (2016).Another popular application of SOMs include morphological classification of optical Díaz-García et al. (2019); Holwerda et al. (2022) and radio which learn to reconstruct a corrupted version of the data, convolutional Autoencoders (CAE, Masci et al., 2011) which use convolution layers in the encoder part of the AE, Variational Auto-encoders (VAE, Kingma and Welling, 2013) which substitute the bottleneck with gaussian distributions of which the mean and standard deviation are learned during the training.This means that the decoder can be used a probabilistic generative model.On the other hand, Vector Quantised-Variatonal AutoEncoder (VQ-VAEs, van den Oord et al., 2017) substitute the latent random variables of VAEs with categorical variables, which have applications relating to speech and language.
classified radio AGN based on their morphologies and Mesarcik et al. (2020) used AEs to reconstruct LOFAR data and assess the quality of the images.

Figure 5 :
Figure 5: Complex frameworks are emerging in modern ML applications.Here we show one example of modelling the autoencoder latent space with a selforganising map, subsequently clustered with k-means.Figure adopted from Ralph et al. (2019).

Figure 6 :
Figure 6: Application of contrastive self-supervised learning.Input features are augmented (rotation, flip, etc), and the contrastive loss is trained to bring representations of the same object closer in the latent space.Figure adopted from Hayat et al. (2021).
Hayat et al. (2021) applied self-supervised contrastive learning on 1.2 million SDSS images (Fig.6), by combining the architecture ofChen et al. (2020b), andHe et al. (2020).They introduced augmentations relevant to astrophysical data, including galactic reddening and point-spread function smoothing.They used UMAP to visualise the 2048d representation, which shows that relative ordering of the learned representation corresponding to the galaxy morphology and orientation.They further showed that the learned representations can be used to assess galaxy morphology and estimate photometric redshift.They reported a training time of 12h, on eight NVIDIA V100 GPUs, for 50 epochs.Since then, further works have explored the use of various contrastive-learning architectures, starting from galaxy images.To name a few,Stein et al. (2021) trained on a sample of 3.5 million galaxy cutouts from the Legacy Survey DR9 12 , using the momentum architecture ofHe et al. (2020).Subsequently, they applied the trained network on 42 million galaxy from the same survey.The resulting representations were then used to identify similar objects.The dataset can be queried based on similarity on a public webpage13 .Sarmiento et al. (2021) used spatially resolved images of 9,507 MaNGA galaxies and recovered the two-three major galaxy populations (passive, starforminng and intermediate) by performing k-means in the representation space.Other recent works that exploit contrastive learning on large galaxy samples include Wei et al. (2022), and Vega-Ferrero et al. (2023).

Table 1 :
for a review), Reviews on the process of learning from data.

p
S (x) p S (y) p S (y|x)

Table 2 :
Farahani et al. (2021)omain adaption and necessary conditions for training.pT(x,y),wecanexpectone of three discrepancies, prior shift, covariate shift, data drift, based on the origin of the disagreement.Table2summarises the shift types, and how they relate to the marginal distribution p(x), the prior, p(y), and the conditional distribution p(y|x).Zhuang et al. (2020)give a structured description of the various methods and strategies available in the ML literature, split mainly into instance-based and featurebased, whileFarahani et al. (2021)give an overview of domain adaptation approaches in shallow and deep learning.

Table A .
3: Dimensional reduction methods discussed in this work