Uncovering Tidal Treasures: Automated Classification of Faint Tidal Features in DECaLS Data

Tidal features are a key observable prediction of the hierarchical model of galaxy formation and contain a wealth of information about the properties and history of a galaxy. Modern wide-field surveys such as LSST and Euclid will revolutionise the study of tidal features. However, the volume of data will prohibit visual inspection to identify features, thereby motivating a need to develop automated detection methods. This paper presents a visual classification of $\sim2,000$ galaxies from the DECaLS survey into different tidal feature categories: arms, streams, shells, and diffuse. We trained a Convolutional Neural Network (CNN) to reproduce the assigned visual classifications using these labels. Evaluated on a testing set where galaxies with tidal features were outnumbered $\sim1:10$, our network performed very well and retrieved a median $98.7\pm0.3$, $99.1\pm0.5$, $97.0\pm0.8$, and $99.4^{+0.2}_{-0.6}$ per cent of the actual instances of arm, stream, shell, and diffuse features respectively for just 20 per cent contamination. A modified version that identified galaxies with any feature against those without achieved scores of $0.981^{+0.001}_{-0.003}$, $0.834^{+0.014}_{-0.026}$, $0.974^{+0.008}_{-0.004}$, and $0.900^{+0.073}_{-0.015}$ for the accuracy, precision, recall, and F1 metrics, respectively. We used a Gradient-weighted Class Activation Mapping analysis to highlight important regions on images for a given classification to verify the network was classifying the galaxies correctly. This is the first demonstration of using CNNs to classify tidal features into sub-categories, and it will pave the way for the identification of different categories of tidal features in the vast samples of galaxies that forthcoming wide-field surveys will deliver.


INTRODUCTION
Faint tidal features are a crucial observational tracer of hierarchical galaxy formation and evolution in the ΛCDM Universe.In this framework, galaxies grow in size and mass by accreting or merging with other galaxies (e.g.White & Rees 1978;White & Frenk 1991), these mergers are often described as being major (mass ratio of ∼ 0.1 -1) or minor (mass ratio ≲ 0.1, e.g.Davies et al. 2015;Hendel & Johnston 2015).A significant proportion of a galaxy's mass can originate from material accreted from other galaxies and, in particular, from minor mergers (e.g.Oser et al. 2010;Ownsworth et al. 2014) which are expected to be more common than major mergers (e.g.Fakhouri et al. 2010).Tidal features are the debris left behind by these interactions (e.g.Toomre & Toomre 1972) either from recently accreted material from minor mergers or late-stage relics of major ones.The morphology of these features varies widely (see, e.g.Quinn 1982), however most are typically low surface brightness (LSB).This LSB nature of tidal features makes them very challenging to study, and shallow imaging surveys will generally miss all but the most extreme examples.However, tidal features in galaxy outskirts are generally long-lived (∼ 1 − 5Gyr; see, e.g.Ji et al. 2014;Mancillas et al. 2019;Yoon & Lim 2020) and predicted to be increasingly common ★ E-mail: alexander.gordon@ed.ac.uk at decreasing surface brightness levels (e.g.Johnston et al. 2008;Vera-Casanova et al. 2022).
Tidal features contain a wealth of information about a galaxy's past.As tidal features are direct byproducts of galaxy mergers, it is possible to reconstruct the merger history of a galaxy by studying its tidal features (e.g.Johnston et al. 1999Johnston et al. , 2008)).In particular, the morphology of the tidal features can be connected to the stellar kinematics and formation history of the galaxy (e.g.Valenzuela & Remus 2024) and can probe the orbital distribution of progenitor galaxies (e.g.Hendel & Johnston 2015).Hence, there is interest in measuring the morphology of the tidal features, over and above generating a large sample where some kind of feature is present.Mergers and interactions have significant impacts on the stellar kinematics of galaxies (e.g.Yoon et al. 2022), and tidal stripping could be a considerable source of star formation suppression (e.g.Spilker et al. 2022).Tidal features can also be studied to give deeper insights into dark matter (e.g.Sanderson et al. 2015;Bovy et al. 2016;Pearson et al. 2022) and as tests of the cosmological model (e.g.Johnston et al. 2001;Conselice et al. 2014).However, to better understand this, it is necessary to build up a significant sample of identified tidal features, which requires probing deep limiting surface brightnesses.
Forthcoming wide-field surveys, such as the Legacy Survey of Space and Time (LSST; Ivezić et al. 2019) at the Vera C. Rubin Observatory and the European Space Agency's Euclid (Laureĳs et al. 2011) , as well as other deep galaxy imaging projects (e.g.ARRAK-IHS; Guzmán et al. 2022), will revolutionise the study of the low surface brightness Universe, including tidal features.These surveys will probe deep limiting surface brightnesses over a vast sky area and uncover many galaxies with tidal features.For example, Euclid will image around 15000 deg 2 to a limit of  VIS = 29.5 mag arcsec −2 (Euclid Collaboration et al. 2022) and LSST almost 18000 deg 2 to   = 30.3mag arcsec −2 after 10 years (Yoachim 2022).LSST alone is predicted to discover millions of tidal features (Martin et al. 2022).However, the scientific potential of this data will only be realised if methods are developed to manage the vast volume of data.
Most of the effort to classify or characterise tidal features has involved one or more experts spending extensive amounts of time visually inspecting each image of a galaxy to identify whether or not a tidal feature is present (see e.g.Atkinson et al. 2013;Bílek et al. 2020;Martin et al. 2022;Sola et al. 2022;Desmons et al. 2023), even where there were some automated aspects to the process (e.g.Kado-Fong et al. 2018).This is all very well for a modest number of samples; however, this process will only scale to a small volume of data from forthcoming surveys.It is, therefore, necessary to develop an automated process to identify and classify tidallydisturbed galaxies.
This issue of too much data is not isolated to tidal feature detection.Many researchers are turning to machine learning (ML) to automate and accelerate analysis with data from these modern surveys.There has been a significant amount of effort to use ML to automate the classification of the overall morphology of a galaxy, such as separating between spirals and ellipticals (see e.g.Dieleman et al. 2015;Huertas-Company et al. 2015;Domínguez Sánchez et al. 2018;González et al. 2018;Barchi et al. 2020;Walmsley et al. 2020;Fielding et al. 2021;Reza 2021;Vega-Ferrero et al. 2021;Zhang et al. 2022;Xu et al. 2023).Several different ML approaches can work for this task; however, Cheng et al. (2020) demonstrated that Convolutional Neural Networks (CNNs) were often the best performing.CNNs are a popular and state-of-the-art method in computer vision problems and perform remarkably well at image classification tasks; because of this, they have become widely used in astronomy (see, e.g.Fluke & Jacobs 2020; Huertas-Company & Lanusse 2023, for a review).Some researchers have attempted to use unsupervised ML to address the scalability issue (e.g.Hocking et al. 2018;Martin et al. 2020;Spindler et al. 2021;Fielding et al. 2022), which has the benefit of not requiring pre-labelled training data and hence much less effort by inspectors.However, it is not always straightforward or guaranteed that the representation clusters will match astrophysical phenomena.
In a related task to identifying tidal features, some researchers have attempted different ML approaches to classify instances of galaxy mergers with several using Neural Networks or CNNs (see, e.g.Ackermann et al. 2018;Pearson et al. 2019;Ćiprĳanović et al. 2020;Ferreira et al. 2020;Suelves et al. 2023).Most of these approaches obtained similar, if not better, results than more traditional numerical markers such as concentration, asymmetry, or a combination of markers (see, e.g.Nevin et al. 2019, for an example of a more conventional approach).Furthermore, ML has been used to identify strong gravitational lenses in images (e.g.Jacobs et al. 2017;Petrillo et al. 2017;Lanusse et al. 2018;Petrillo et al. 2019), which is a similar problem to that of tidal feature detection with extended regions of low surface brightness material.Thus, ML and CNNs are highly usefuland, arguably, essential -tools for researchers to extract meaningful science with the volume of data from modern surveys and are likely suitable tools to classify tidal features.
There have been some attempts in the literature to perform binary classification of tidal features using supervised ML (e.g.Walmsley et al. 2019;Domínguez Sánchez et al. 2023;Desmons et al. 2024).A binary classifier aims only to decide whether or not a tidal feature is present without then attempting to categorise those features.While binary classifications can indicate the frequency of minor mergers and accretions, insights into galaxy assembly come from more detailed analyses, such as the nature of the different features and their exact morphologies (e.g.Varghese et al. 2011;Hendel & Johnston 2015;Nibauer et al. 2023).
A particularly relevant study is that of Walmsley et al. (2019, hereinafter W19) who trained three binary classifiers to detect tidal features in Canada-France-Hawaii Telescope Legacy Survey (CFHTLS; Gwyn 2012) data.This data covered around 170 deg 2 to a depth of   ∼ 27.1 mag arcsec −2 and the galaxy sample had previously been visually inspected by Atkinson et al. (2013) for the presence of tidal features.W19 used those labels to train their classifiers, with just 305 galaxies showing evidence of a tidal feature.The first of their classifiers was a single CNN, with the other two being five single classifiers combined into an ensemble with different configurations.They found that the performance was better for the ensembles than a single CNN and was similar for both configurations.The ensembles recovered an average of 76 ± 2 per cent of the true instances of galaxies with tidal features (true positive rate) for a contamination of only 22 per cent.Contamination here refers to galaxies without tidal features classified as having tidal features (false positive rate).
Similarly Domínguez Sánchez et al. (2023, hereinafter DS23) used 5,835 mock images of Hyper Suprime-Cam Subaru Strategic Program (HSC-SSP; Aihara et al. 2018) galaxies, generated from NewHorizon (Dubois et al. 2021).These mock images were previously visually inspected by Martin et al. (2022).The images were generated from just 36 parent galaxies with noise added to simulate different limiting surface brightnesses and then separated into two training sets: the original sample of   = 28 − 35 mag arcsec −2 and a shallower set with additional low depth images,   = 26 − 35 mag arcsec −2 .These two training sets were then used to train CNNs with the deeper dataset, which we estimated by reading their figure to have a TPR of 0.875 ± 0.005, performing better than the shallower set with TPR = 0.815 ± 0.005, at a level of 20 per cent contamination.The authors attempted to transfer their network to real HSC-SSP data but reported that the results were significantly degraded compared to their simulated counterparts, and no performance statistics were provided.Desmons et al. (2024, hereinafter DBL24) used HSC-SSP Ultra-Deep data, covering only 3.5 deg 2 of the sky but with a much greater surface brightness limit of   ∼ 29.82 mag arcsec −2 , to train a binary classifier.The sample was also small at just 380 galaxies with tidal features.Still, it was significantly improved in depth for low surface brightness detection.Instead of a purely supervised or unsupervised approach, DBL24 used a self-supervised process that is somewhat of a hybrid between supervised and unsupervised.Overall this allowed the network to be trained with fewer labelled data, saving inspectors significant time and effort.DBL24 trained their network to identify important parts of the images using an unsupervised approach that compared augmented -e.g.rotated, translated, etc -versions of the same image.They then trained the final part of the network, which identified whether or not a tidal feature was present, using the already trained part to extract important image features and a small number of labelled data.DBL24 report that their network achieved a TPR of 0.94 ± 0.1 for FPR = 0.2.
In this paper, we go beyond simply the detection of tidal features in galaxies and conduct the first exploration of Convolutional Neural Networks to classify tidal features into appropriate sub-categories.Section 2 details the data and the selection criteria we applied to pro-duce a training set.Section 3 discusses how we used visual inspection to generate a label for each galaxy representing its tidal features.Section 4 presents the network we used and the results of applying that to recreate our labels.Section 5 discusses some suggested improvements and the issues we encountered, and we conclude in Section 6.

THE DATA
To train a CNN to achieve a satisfactory level of accuracy, we needed an extensive dataset comprising identified and labelled tidal features.Despite the depth being notably shallower than the expectations for forthcoming imaging surveys, we chose to use data from the Dark Energy Camera Legacy Survey (DECaLS; Dey et al. 2019).This was motivated by the fact that it is one of the largest uniform datasets currently available, and a significant portion of it has previously undergone morphological study by Walmsley et al. (2022, hereinafter W22), which we could build upon for our work on tidal feature identification.This prior work provided an initial indication of the presence of a tidal disturbance, enabling us to prune down the sample to contain only those sources most deserving of visual inspection.Furthermore, this dataset served as an excellent testbed for developing our concepts and allowed us to compare with previous works on the subject.

DECaLS
DECaLS aimed to identify targets for the Dark Energy Spectroscopic Instrument (DESI) survey and used the 4m Blanco telescope at the Cerro Tololo Inter-American Observatory in Chile.Around 9000 deg2 of the sky was imaged using a three-pass tiling system, where each pass was slightly offset from the others.The survey was kept as uniform as possible by using dynamic observation to automatically select targets and exposure times based on observing conditions.The result of this was around 835 million objects imaged in the g, r, and z bands with a native pixel scale of 0.262 arcsec px −1 .The limiting point source depth in the g band is 23.95 (AB) for a median 5 detection.We estimated the 3 limiting surface brightness magnitude in 10 ′′ × 10 ′′ boxes to be   ∼ 28.8 mag arcsec −2 , by following the procedure set out in Román et al. (2020).Figure 1 presents normalised histograms and median estimates of the limiting surface brightness magnitude across ∼ 2000 images of galaxies for each DECaLS band.The images used were FITS cutouts of galaxies downloaded from the Legacy Survey 1 cutout service 2 at the native pixel scale.Our estimates of the DECaLS surface brightness limiting depth are similar to those quoted in other studies (e.g.Román et al. 2021;Martínez-Delgado et al. 2021).

Galaxy Zoo DECaLS
We made use of publicly available data products3 produced by W22 in the form of both galaxy images and classifications.In W22, the authors used DECaLS data to construct RGB Portable Network Graphics (PNG) images of the subset of galaxies that were included in the The median (solid line) and 68.2 per cent confidence interval (dashed lines) were estimated by determining the limit in ∼2,000 FITS cutout images of galaxies.For each image, we followed the method set out in Román et al. (2020) and used randomly sampled 10 × 10 arcsec 2 boxes.
NASA Sloan Atlas (NSA)4 .The NSA contains a catalogue of various parameters for local galaxies which were primarily imaged in SDSS and GALEX.Basing the sample on the NSA introduced two selection cuts: most selected galaxies are brighter than   = 17.77unless they were included in deeper SDSS fields, and the sample has a maximum redshift of  = 0.15.Additionally, W22 added two further cuts limiting the selection to galaxies with a Petrosian radius of at least 3 arcsec -such that the galaxies were sufficiently resolved for classification -and discarding any incomplete images where more than 20 per cent of the pixels in any band were missing.In total, the remaining sample comprised almost 314,000 galaxies.W22 constructed their PNG images to be 424 × 424 pixels by downloading the FITS files from the Legacy Survey cutout service.They ensured that the whole galaxy was suitably visible on the image by resizing the image to an appropriate interpolated arcsec per pixel scale.They then multiplied the g, r, and z band fluxes by 125.0, 71.43, and 52.63, respectively, such that the false colour images had an appropriate range of colours in RGB; W22 chose these values by hand.The pixels with low fluxes were then desaturated to avoid a speckled effect in the images.Finally, the fluxes were scaled by sinh −1 () to compensate for the wide range of pixel values, linearly rescaled to be in the range (−0.5, 300) to remove the brightest pixels, and then clipped to the usual (0, 255) range for PNG files.
W22 used three Galaxy Zoo: DECaLS (hereinafter GZD) campaigns to obtain volunteer classifications for all galaxy images; each campaign corresponded to a different DECaLS data release -DR1, DR2, and DR5.Volunteers were asked a series of questions in a decision-tree structure to classify the bulk morphology of the galaxy, such as identifying if the galaxy was early-or late-type, how many spiral arms there were, and if a bar or bulge was present.Between DR1&2 and DR5, the decision tree questions were changed to improve clarity and direct classifications towards more specific scientific goals.For the DR1&2 campaigns, a median of 38 volunteers responded for each galaxy.However, the significantly larger DR5 campaign had a median of just five volunteers per galaxy.
As part of both decision trees, the volunteers were asked to indicate if the galaxy was merging or disturbed in some manner, the latter of which we took as a proxy for the potential to have a tidal feature.Figure 2 provides the options for indicating a disturbance presented to volunteers in both the DR1&2 (a) and DR5 (b) decision trees.W22 changed the possible answers for the merger question between DR1&2 and DR5 to reflect what the volunteers would see in the image more directly.In particular, the major and minor disturbance labels referred to the extent and size of the debris, not necessarily the origin of the disturbance as a major or minor event.Hence, galaxies indicated as either a major or a minor disturbance were likely to have tidal features.
The classifications from the GZD campaigns were then processed and used to train a machine-learning classifier.The classifier constructed by W22 was a Bayesian Convolutional Neural Network based on the EfficientNetB0 architecture (Tan & Le 2019).It was trained to predict how volunteers would have responded to the DR5 decision tree, and its accuracy varied from 77 to 99 per cent compared to the Table 1.Number of galaxies that passed each selection criterion.We applied three selection criteria to the W22 sample of 313,789 galaxies.The absolute magnitude of the galaxies was limited to the range −19 ≥   ≥ −22, and images with a prediction greater than 0.1 of an artefact were eliminated.The sample was separated into those potentially having a tidal feature and those assumed not to by considering the output of the W22 classifier for the merger question.See the text for full details on this process.Galaxies where the images were unavailable were removed.Finally, some galaxies were removed after the visual inspection (see Section 3.3).

Criterion
Number volunteer responses depending on the question being considered.We used the catalogue of predictions5 and the corresponding images as a starting point for our analysis.

Sample Selection
From the GZD data, we generated a sample of galaxies with a high likelihood of having tidal features by imposing three selection criteria in addition to the cuts introduced in W22.We first limited the sample to have an absolute magnitude in the range −19 ≥   ≥ −22.
Removing the faintest galaxies restricted the number of contaminating intrinsically irregular galaxies, most of which are fainter than   ≥ −19 in the local universe (Ann et al. 2015).At the surface brightness depth of DECaLS, faint irregular galaxies were often difficult to distinguish from tidally-disturbed galaxies.We removed these galaxies to avoid generating necessarily uncertain classification labels for them.The remaining two selection criteria were based on the machinelearning predictions for questions in the catalogue.Images predicted to have a greater than 0.1 chance of having an image artefact were removed.Finally, the last criterion selected the galaxies most likely to have some tidal feature, allowing us to reduce the number of galaxies to a manageable amount.Any galaxy with a prediction of greater than 0.4 in either the merging_major-disturbance_fraction (hereinafter major) or merging_minor-disturbance_fraction (hereinafter minor) columns in the catalogue was assumed to potentially include a tidal feature.W22 recommended using major greater than 0.6 and minor greater than 0.4 to identify post-merger and low surface brightness galaxies.We randomly sampled and inspected a small number of galaxies with varying values of both major and minor, and from this extended our threshold to include both major and minor values greater than 0.4.
Of the roughly 314,000 galaxies, 1,935 passed the above selection criteria and were assumed to be tidally disturbed.Images for seven of these were not included in the data set produced by W22, giving a set of 1,928 galaxies to visually inspect for tidal features.Furthermore, we generated a complementary sample of galaxies assumed not to have tidal features by randomly sampling those with minor, major, and merging_merger_fraction predictions of less than 0.08.The breakdown of how many galaxies passed each criterion and were subsequently separated into assumed to have a tidal feature and assumed not is provided in Table 1.We did not visually inspect this undisturbed sample, but it was included in the training set for the convolutional neural network.

Preparation for Inspection
We began by regenerating the thumbnails of the tidal feature sample for visual inspection, using a stretch that enhanced the low surface brightness regions.FITS images of 500 × 500 pixels centred on each galaxy were downloaded from the Legacy Survey cutout service at the same interpolated pixel scale as in W22.We then stacked the images by summing the g, r, and z bands, ensuring sensitivity to tidal features regardless of their colour.
We tested various pixel stretch algorithms and selected the three best at enhancing the appearance of the tidal features.These were a logarithmic scaling and two novel arcsinh-based algorithms.The pixel-wise output,  ′    , of the novel algorithm depended on the input pixel value     and four other controllable parameters: the maximum , minimum , stretch , and power .The output was then where and was inherently constrained to be in the range [0, 1] due to the normalisation.During the visual inspection, we presented the inspector with these three scaled images alongside the original produced by W22.An example of the images presented during the visual inspection is provided in Figure 3.During training, the network was presented with only the images produced by W22 (see Section 4).

Inspection Process
We used the Zooniverse.org6platform to make and record classifications for each of the 1,928 galaxies likely to show tidal features.Zooniverse has become a valuable tool for detection and classification studies, most notably including citizen scientists (Lintott et al. 2008) such as the volunteers in W22.In our case, each of the inspectors (the authors) was presented with the four associated images for each galaxy, as described above and shown in Figure 3.The inspector was then asked to classify the galaxy into five non-exclusive categories: arm, stream, shell, diffuse, or uncertain.Where there were multiple features of different kinds, the inspector selected all that applied, and where there were several of the same features, the appropriate category was selected only once.Each of the three inspectors provided separate classifications, which were then combined to produce a label for every galaxy.
The choice of these categories was motivated in part to follow similar works in the literature and to have some connection to the potential astrophysical origin of the feature, thus allowing the classification to be driven towards specific science cases.Figure 4 provides an example of each type of tidal feature we chose, and a brief description of their characteristics is provided below: Arm: tidal arms form from material that originates in the host galaxy, so the surface brightness of these features is generally higher close to the galaxy's main body and tails off with increasing radial distance.They have colours similar to the outer regions of the host galaxy and should be clearly connected to it.We follow the nomenclature of Atkinson et al. (2013) and call these features arms but note that they are also referred to as tails in the literature (see e.g.Bílek et al. 2020;Martin et al. 2020;Sola et al. 2022;Desmons et al. 2023).
Stream: streams have similar morphologies to arm features; however, they are generally not physically connected to the parent galaxy.They are often brightest near the remaining core of the progenitor, with surface brightness falling off as one moves away from this.As they originate from the disruption of a small satellite galaxy, they are usually far fainter than the parent galaxy.These features could be short, linear, or wrapped around the parent galaxy, depending on the type and inclination of the orbit.Most other literature works include a stream class (Kado-Fong et al. 2018;Bílek et al. 2020;Martin et al. 2020;Sola et al. 2022;Desmons et al. 2023) and we note that this class contains the linear class from Atkinson et al. (2013).
Shell: shells generally have some symmetry to the shape, such as a shell or fan-like body around the host galaxy and have well-defined, often brighter, edges or caustics.They are likely to originate from nearly radial mergers (Pop et al. 2018).We combine the shell and fan classes from Atkinson et al. (2013), and similarly to streams, the same literature works include a shell class.Diffuse: diffuse features lack any well-defined symmetry to the feature or do not fit well into any of the other categories.They have reasonably irregular or asymmetric shapes; however, there can be some ambiguity between genuinely low surface brightness irregular galaxies and diffuse debris.The contrast between the galaxy's central region and the debris is a critical diagnostic.When this is small, the systems are likely to be genuinely low surface brightness or irregular galaxies, whereas significant contrast is probably more reflective of debris.The diffuse class is similar to those from Atkinson et al. (2013), Martin et al. (2020), andDesmons et al. (2023) with the same name.Some of the diffuse features may be genuinely diffuse in nature, while others may simply be very low signal-to-noise examples of some of the previously-mentioned classes.
Uncertain: when a galaxy could not be reliably classified into one of the above categories, it was labelled uncertain.Most of these galaxies consisted of those too faint or small to reliably say any feature was present or those insufficiently distinct from intrinsically irregular galaxies.
We chose not to include some classes from other works such as bridges, merger remnants, or double nuclei (Martin et al. 2020;Desmons et al. 2023) as our focus concentrates on post-merger features, and these classes likely represent a mid-merger phase.

Classification Demographics
We combined the inspector classifications for every galaxy by considering where the majority (at least 2 out of 3) agreed that a particular feature was present.For example, if one inspector said arm + stream, another said arm, and the last indicated stream + shell, the resulting label would be arm + stream.We follow this process for uncertain labels, but where a galaxy was given an uncertain designation, it was not given any other label regardless of whether it would have received a label corresponding to a tidal feature.Where the individual classifications did not agree on any label, the galaxy was classified as none.Thus, we end up with seventeen possible labels based on the various combinations of features -specifically, none; uncertain; arm; stream; shell; diffuse; arm and stream; arm and shell; arm and diffuse; stream and shell; stream and diffuse; shell and diffuse; arm, stream and shell; arm, stream and diffuse; arm, shell and diffuse; stream, shell and diffuse; and arm, stream, shell and diffuse.
While the ultimate goal of the visual inspection was to generate labels to train a CNN, it is also of some interest to explore the properties of the classifications.Hence, in Figure 5, we present the demographics of our classifications, split based on whether the prediction of the galaxy was from the major (solid) or minor (outline with a hollow centre) columns in W22.In both (a) and (b), the numbers provided represent the number of galaxies in each category, again divided based on the W22 prediction.
Figure 5(a) provides the number of galaxies with each feature class, noting that this Figure will count galaxies with more than one feature multiple times.Both stream and arm features appear to be roughly as common as each other (16.8 and 14.7 per cent of the inspected galaxies, respectively), diffuse features appear to be the most common (28.0 per cent), and shells the least (only 3.5 per cent).
Figure 5(b) shows the number of features present for each galaxy; uncertain galaxies have been included in the none category as there was no useable feature present.Most galaxies exhibit a single tidal feature at the surface brightness depth of DECaLS.For some galaxies, all the authors agreed that there was a feature (i.e.not labelled uncertain), but no consensus on the type was reached.Hence, more galaxies are listed under none (941) than those under uncertain (866).Figure 5(c) provides the Venn diagram of the classification demographics.We observe that no galaxy contains all four features, and no combination includes both an arm and a shell feature.Figure 5. Demographics of the visual inspection of galaxies for the presence of tidal features.Each galaxy was labelled by considering where two out of three inspectors agreed a given feature was present.(a) provides the number of galaxies that have arm, stream, shell, diffuse, and uncertain labels.Galaxies labelled with multiple features are counted multiple times in this Figure .(b) indicates how many galaxies contained multiple features.Galaxies where no label was agreed on or labelled as uncertain are included here as none.In both (a) and (b), the histograms are split by whether the W22 classifier rated the galaxy highly in the merging minor-disturbance fraction or merging major-disturbance fraction columns.Finally, (c) shows the overlap between the classifications made.In all three, the numbers shown indicate the number of galaxies in that category.
What is of particular note is the minor galaxies.As shown in Figure 5, most of the uncertain and none feature categories are occupied by those indicated as minor in W22.Of the 841 galaxies labelled by W22 as a minor disturbance (43.6 per cent of the inspected sample), less than 4 per cent had a label that was not uncertain.Therefore, we decided to remove all minor galaxies entirely and instead create our training sample from the major disturbance galaxies only.
We believe this mismatch between the predictions from the W22 classifier and our inspection results from a mixture of effects.In most cases, the volunteers rated the galaxy as highly likely to have a minor disturbance and the automated classifier followed these labels.However, in some cases, the automated classifier overestimated the fraction of votes for the minor disturbance such that it was above our threshold even though the volunteers had rated it unlikely.As a curiosity, we also note that a significant proportion (∼20 per cent) of the minor galaxies we remove received fewer than five votes from the volunteers.In addition to removing all of the minor galaxies, we excluded those that received an uncertain or none label.Thus, our final training sample included 956 galaxies with tidal features, about one-half of the 1,928 we initially classified.Nonetheless, this is still one of the largest samples of galaxies with faint tidal debris amassed to date.
Figure 6 presents the distributions of various properties of the galaxies in the sample split based on whether it had a tidal feature (956, red) or not (9,981, blue).The properties include the absolute and apparent magnitudes in the r-band, the redshift, and the size of the galaxies.The sizes of the galaxies were determined by calculating the angular diameter distance from the redshift and combining this with the Petrosian half-light radius.We assumed the Planck Collaboration et al. ( 2020) cosmology with  0 = 67.66km Mpc −1 s −1 , Ω  = 0.3111, Ω Λ = 0.6889.Each distribution was normalised by the number of galaxies in that category, and the dashed line shows the median value.Within this particular sample, the most noticeable result is that tidally-disturbed galaxies tend to be more intrinsically luminous, although this may be a result of galaxies being removed during our visual inspection if they were too faint or small to reliably classify as having tidal features.

AUTOMATED CLASSIFICATION
The visual inspection process took ∼ 77 hours in total for just 1,928 galaxies.This was in part due to the need to identify faint and subtle structures, which may have increased the inspection time compared to other inspection problems (e.g., spiral vs. elliptical).Regardless, this was a significant amount of effort on a relatively small data set, especially compared to the volume of data expected from forthcoming surveys.Therefore, we used our set of visually inspected galaxies to provide a training set to develop an automated process for classifying tidal features with the intention of applying the process to future data later.We chose to use a CNN for our approach.

Architecture
In essence, a CNN is a model comprising a series of different computations or operations and how these are organised is known as the architecture.The architecture can be considered a series of connected nodes grouped into layers.All the nodes in a layer perform the same operation, each representing where a given operation occurs.The outputs from previous nodes are used as inputs to the next layer of nodes.In a typical CNN used for classification, the layers can be grouped into two parts: a feature extraction part and a classification part.The feature extraction part in a CNN consists of a series of convolutions with various kernels and biases.The goal of the fea- ture extraction part is to generate a latent space (a multi-dimensional embedding) representation of the galaxies, specifically where similar galaxies cluster together within the space (e.g.Alzubaidi et al. 2021).The network's classification part takes the latent space representation of the image and applies a series of linear combinations with weights and biases to generate a prediction.Overall, the goal is for the network to optimise the parameters of the kernels, weights, and biases such that the output matches the target for the given task as well as possible.
We used a modified version of the network architecture used by W19 and subsequently adopted by DS23.The architecture is provided in Table 2, along with the number of trainable parameters associated with that layer, and is described in more detail below.Our aim for this work was not to provide a definitive solution to this problem but rather to demonstrate that it was possible to use CNNs to classify different kinds of tidal features, so we did not further optimise the hyperparameters of the network beyond that of W19.Instead, we focused on testing the applicability of the already optimised network to our slightly different task.
The extraction part of the network was composed of three convolutional blocks: a 2D convolution layer with a 3 × 3 kernel, a Rectified Linear Unit (ReLU) activation function, and a 2 × 2 shaped max pooling.The blocks consisted of 32, 48, and 64 nodes, respectively.The convolutional layer works by convolving each channel of the input image (e.g.survey bands) with a kernel.The resulting outputs are then summed over the number of input channels.This process is repeated for all m nodes of the convolutional layer.The network often comprises a series of convolutional layers that apply this process to transform the m nodes of the previous layer to the n of the next (e.g.Goodfellow et al. 2016).Pooling layers reduce the size and training time of the network by replacing values over a specified region with a summary value, such as the maximum or average (e.g.Alzubaidi et al. 2021).This also induces the network to be invariant to small translations in the input image (e.g.Goodfellow et al. 2016).
The classification part consisted of two fully connected (or dense) linear layers.Fully-connected layers apply a linear combination of weights and biases to all of the outputs from the previous layer (e.g.Goodfellow et al. 2016).The first layer had 64 nodes and the other was a 4-node output layer -one node for each category of tidal feature (this was a single node for the binary classification, see Section 5.2.1).
During training, a portion of the data is reserved for validation; this validation data effectively serves as unseen data to test the generalisation of the network.When the metrics used to monitor training performance begin to diverge from those evaluated on the validation data, this can indicate that the network is overfitting the training data.We employed both dropout -which we applied between both of the fully connected layers -and early stopping to prevent the network from overfitting to the training data, which would negatively impact the generalisation of the network to unseen (or testing) data.Dropout randomly removes neurons at each step during training with a specified rate (Srivastava et al. 2014), which we chose to be 50 per cent in line with W19.This prevented particular parts of the network from being overly crucial for a given classification and meant the network had to learn many independent features (Alzubaidi et al. 2021).Early stopping monitors the validation loss metric.It determines whether an improvement has been made based on the overall antecedent minimum value and that determined at the current epoch.After a specified number of epochs -the patience period -with no improvement, the network stops training and restores the network parameters that obtained the best value of the validation loss.We chose this patience period to be 40 epochs.

Augmentation
We can see from Table 2 that there were ∼ 10 6 parameters that needed to be constrained through training.This meant we required a large and diverse training set to avoid overfitting the data.Unfortunately, we could not expand our training set beyond the galaxies we had already inspected for tidal features.However, data augmentation allowed us to artificially extend the training set (see, e.g.Goodfellow et al. 2016;Shorten & Khoshgoftaar 2019;Alzubaidi et al. 2021) without visually classifying further galaxies.The presence or absence of a feature was independent of transformations or augmentations of the input image, such that instances of the same image can be used repeatedly during training.W19 took this approach when they were training their network on the 305 galaxies inspected by Atkinson et al. (2013).We applied the following augmentations to the data randomly, noting a slight modification from W19 in the rotation augmentation: (i) horizontal or vertical flip or both (ii) rotation in the range −  2 , +  2 (iii) translation up to ±5 per cent along both axes (iv) zoom in or out up to 10 per cent After these augmentations, we downsized by rebinning the images to 256 × 256 pixels.We found that resizing the images improved the network performance on the order of a few per cent and chose 256 × 256 as the optimal.

Training Sample Construction
We used our 956 visually inspected galaxies with tidal features and those assumed not to have a tidal feature (9,981, see Table 1) to construct training sets using a five-fold cross-validation scheme (see, e.g.Goodfellow et al. 2016;W19).In this scheme, we randomly distributed galaxies into five groups; each was unique and roughly equally sized.The random distribution process was performed separately for each type to ensure that each group included both tidal and non-tidal galaxies.To create each fold, we selected one of the five groups to be the testing set and complemented this with a training set.The training set comprised all of the galaxies with tidal features Table 3. Composition of each fold for training and testing the network.The galaxies with and without tidal features were divided into five groups.Each fold was constructed to include a testing set comprising one of the five groups and a training set made up of the other four groups.A random sampling process was used to ensure the network was trained on an equal number of galaxies with and without tidal features.Each fold is unique and used to test the network on unseen data, allowing all of the galaxies to be used in testing the network.in the other four groups and a random sampling of an equal number of those without, similarly originating from the other groups.Thus, every galaxy in our sample could be used as an unseen testing example to evaluate the performance of our network and our training sets remained balanced between those with and without tidal features.

Fold
The number of galaxies with each type of feature in each fold is presented in Table 3.

Multi-Label Results
As with the other works on the subject (e.g.W19; DS23; DBL24), we employ the true positive rate (TPR) or recall, false positive rate (FPR), and area under the curve (AUC) metrics as well as precision,  1 , and accuracy to measure the performance of our classifier.Each metric compares some combination of the number of true positives, true negatives, false positives, and false negatives, abbreviated as TP, TN, FP, and FN, respectively.The predictions from the network were separated into positive and negative based on some predictive threshold , and a binary label was assigned for each class based on whether the prediction was greater or lesser than the threshold.These were then compared to labels assigned during the visual inspection to determine whether the predictions were true or false.The FPR is a measure of the amount of contamination in the predicted positive samples or the probability a predicted positive example is negative.
Conversely, the TPR or recall provides the probability that actual positive examples will be predicted as positive.The AUC score measures the area underneath the curve created by the FPR and TPR values for each value of .An AUC of 0 means that the predictions were completely wrong, whereas an AUC of 1 means that the predictions were entirely correct.The precision metric Precision = TP TP + FP (5) measures the fraction of predicted positives that were correctly predicted.The  1 score is the harmonic mean of the precision and recall metrics.The accuracy metric has its usual meaning as the total fraction of correct predictions.
Each of the folds we created was used to train the network individually.The training set of that fold was input to the training algorithm, with the data split 9:1 for training and validation, respectively.At each step, the images created by W19 were loaded into the network and randomly augmented.We applied a weighting to the training for each class to compensate for the imbalance between the different classes.The weighting was based on the total number of training samples in the set,  samples , divided by the product of the number of classes,  classes , and the number of examples for that class,   .This prioritised training for the higher-weighted classes to compensate for the fewer numbers.Table 3 provides the mean training weights across the folds.Once the network had been trained, it was used to generate predictions for the testing set of that fold.All memory of the network was erased, and the process was repeated for the next fold.The final result was that each galaxy in our sample had received a prediction from a network where that galaxy had not previously been seen by the network.We then combined these predictions to evaluate the performance of the network.This whole process was repeated for ten trial attempts at training and testing to estimate the error and stochasticity of the performance metrics.Hence, for each galaxy, there were ten unique predictions.We evaluated the performance of the network independently for each of these attempts and then averaged over those performances.As mentioned, each trial network was trained for between 40 and 300 epochs, with the median number of epochs being 196.0 +97.5 −48.5 .We provide the training and validation loss metric across all five folds and ten attempts in Figure A1 in Appendix A and verify that no overfitting occurs.
Figure 7 shows median receiver operating characteristic (ROC) curves evaluated by combining the testing set predictions and labels from each fold and averaging over ten separate trial runs.As the testing sets had roughly one galaxy with a tidal feature for every 10 without, reflective of the true population at this surface brightness depth, the ROC curves were determined from an imbalanced set.The shaded regions in the Figure provide the 68.2 per cent confidence intervals.
The ROC curve shows the relationship between the FPR (-axis)  The median was calculated by combining the predictions for the testing sets across all five folds and averaging over ten independent trial runs of the network, and the shaded regions indicate the 68.2 per cent confidence interval.The ROC curve indicates how the fraction of correctly identified galaxies (TPR) and the fraction of incorrectly identified galaxies (FPR) change with respect to the predictive threshold -that being the value above which the output of the classifier would be taken to indicate a prediction the feature was present.The macro-average averages each of the classes' ROC curves, and the microaverage treats each label independently.
and TPR (-axis) for different predictive thresholds, where lower thresholds result in more completeness but also more contamination.Each class was treated independently because they were nonexclusive, as a galaxy could have more than one feature of different classes.Thus, the ROC curves and other metrics were determined by considering the predictions and labels for each class in a binary fashion (e.g. by comparing only the yes/no label for arms to the predicted value for arms).The exception is for micro-averaged values, which compare every prediction to its label, regardless of the class.The macro-average instead averages over the metrics evaluated on each class, where each class was weighted the same.Figure 7(a) shows the ROC curve for each class -where each class is treated independently -and (b) provides the micro-and macro-averages.It is very encouraging to see that the ROC curves are well separated from the random guessing line, corresponding to the TPR and FPR being equal across all thresholds.
Although the performance is generally excellent across all classes, some small variations can be seen.In particular, Shell features were recovered at a lower rate than other classes, with the network obtaining a median TPR of 0.970 ± 0.008 for a fiducial contamination of 20 per cent (i.e. at an FPR of 0.2).In contrast, diffuse features were recovered well with a median TPR of 0.994 +0.002 −0.006 for the same degree of contamination.Table 4 provides the median TPR values for each class and the micro-and macro-averages at the reference 20 per cent contamination.

DISCUSSION
Our results are the first demonstration of using deep learning to classify different types of tidal features.This presents an excellent opportunity for forthcoming imaging surveys such as LSST (Ivezić et al. 2019) and Euclid (Laureĳs et al. 2011) to investigate the occurrences of the different species of tidal feature and thus, hopefully, shed light on the formation mechanisms and provide substantial data for population studies.As our primary focus for this work was to demonstrate the application of CNNs to classifying different categories of tidal features, we leave investigations of the properties and populations of these features to subsequent works.Although we observed a very good performance for our network from the ROC curves, we do note that there are issues and some improvements needed which are discussed below.
One of the first issues we noted was that the output predictions did not always reach a maximum of 1.If the network's output is considered as a probability that a feature is present, then the network is not confident that a feature is there.We expect the output values to match the labels as best as possible, so they should be 1 where the label is 1 (i.e., the binary indicator for that feature being present).However, this aspect of the performance is not captured in the ROC curves seen in Figure 7.The ROC curves were determined by selecting values of the predictive threshold between 0 and the maximum output such that the ROC curve starts at (0, 0) and ends at (1, 1).To gain further insight into the performance of the network, we provide the distribution of output values in Figure 8 split between those with each type of feature (blue) and those without (red).Also shown in Figure 8 is the median and 68.2 per cent confidence interval of the maximum output prediction from the network.As mentioned, this does not always reach the expected value of 1 and varies depending on the class, with streams having the lowest maximum output of 0.552 +0.045 −0.067 and shells the highest at 0.989 +0.006 −0.019 .
Table 4. Median performance metrics and 68.2 per cent confidence intervals.The median was determined by combining the predictions for each fold and averaging over ten trial runs of the network.The metrics are shown for each category of tidal feature, as well as the micro-(equal galaxy weighted average) and macro-averages (equal class weighted average) and the binary classification (see Section 5.2.1).The TPR was evaluated at a fiducial FPR of 0.2.The accuracy, precision, recall, and  1 score metrics were evaluated at an optimal threshold,  opt , which was selected to maximise the  1 score.The  opt value for the binary classification was taken to be 0.5 rather than the maximising value.The histogram is split based on whether the truth label for that class is positive (red) or negative (blue).Each value shows the median fraction of positive and negative galaxies with predictions in that bin, and the error bar shows ±1.The black dashed line and shaded region shows the median ±1 maximum output prediction across all ten trials.

Variations in class performance
The performance of our network varied across the features being detected; additionally, this variation was different when considering different metrics.For example, looking at the AUC scores in Figure 7(a), it would appear that the stream class is marginally poorer in terms of performance compared to the others.However, from the TPR rates at a fiducial FPR of 0.2 in Table 4 it would appear that the shell class had a slightly poorer performance than the others.In both AUC and TPR(FPR = 0.2) terms, the diffuse class appears to be the best-performing, however the significance is < 3.
In Figure 9 we present the accuracy (a), precision (b), recall (c), and  1 (d) metrics for each class evaluated at a range of predictive thresholds in the range [0, 1] and averaged across the ten trials, along with the 68.2 per cent confidence intervals.Shown too are the micro-and macro-averages of these metrics.We note that results with a threshold beyond the maximum output predictions (shown in Figure 8) are meaningless, and hence the metrics are truncated at the maximum output for their respective class.We estimated an optimal threshold,  opt , that maximised the  1 score and show each of the metrics evaluated at the respective  opt as points on the Figure with the associated error.These values are also provided in Table 4.We again observed that the diffuse class appeared to be the bestperforming in terms of precision, recall, and  1 .Interestingly, on average across all three of these metrics, the network performance correlates with the number of galaxies in that class.Indeed, the diffuse class is the best-performing and had the most galaxies (517), and similarly shells was the worst with the least (67).
Conversely, the performance appeared to be roughly anticorrelated with the sample size when considering instead the accuracy metric.This likely resulted from the high fraction of galaxies without tidal features in the testing sample, most of which were then classified correctly.Although the network was trained on a 1 : 1 ratio, galaxies without tidal features outnumbered those with tidal features at a rate of around ∼ 10 : 1 in the testing set.We observed that 71.5 +6.0 −6.3 , 82.8 +4.2 −5.5 , 66.7 +8.8 −11.6 , and 60.5 +3.3 −7.8 per cent of the false positives for the arm, stream, shell, and diffuse classes, respectively, had some other type of tidal feature.Together with the high accuracy scores, this indicates that the network was confident at a binary classification (i.e.distinguishing where some kind of tidal feature is present or not; this is explored further in Section 5.2.1) but sometimes struggled to determine the type of feature, leading to the low precision scores (at low thresholds, see Figure 9(b)).
We investigated if the performance of the network had any dependence on the angular size or apparent brightness of the galaxies.We observed that the distribution of false positives was slightly skewed towards larger angular sizes than the distribution of galaxies without tidal features.This small offset could have various origins, such as a bias in the network or in the visual inspection process that created the labels.On the other hand, the effect could reflect a genuine feature, with tidally-disturbed galaxies in our sample being somewhat larger than the overall population.Further exploration of this offset is beyond the scope of the current work.
Furthermore, we consider that the visual inspection labels may The metrics were determined by combining the predictions for every galaxy across all five folds and splitting these into positive and negative based on the predictive threshold.These were then compared to the ground truth labels assigned to the galaxy for each of the given classes.The figure shows the median of ten attempts with the shaded area providing the 68.2 per cent confidence interval.Filled circles provide the value of each metric at the optimal threshold,  opt , as provided in Table 4.
have introduced some noise.We often found it difficult during the visual inspection process to separate instances of the two classes as the defining features of the classes are very similar, often with the inspector having to make some estimation of the likely origin of the feature.In particular, inspectors frequently disagreed on whether a feature was a stream or arm.This suggests that these classes could be combined or the visual inspection process improved to ensure a clearer distinction between the classes, such as more explicit instructions on the differences.We suspect that the network may have been susceptible to this uncertainty in the actual label.We found 25.3 +3.9 −2.5 per cent of the false arm positives had streams, and similarly 27.0 +1.3 −2.1 per cent of the stream false positives had arms.Although these are significant, to put that in context, over fifty per cent of the false positives for the arm and stream classes included diffuse features.We investigated the impact of improving the classification scheme to reduce label noise, and although limited in scope, there was no significant improvement made given the time cost for an entire reclassification of the whole sample (see Appendix B).Finally, we used our network to generate predictions for the uncertain galaxies we excluded from the training sets (see Section 3.3).As expected, these galaxies generated a broad distribution of predictions both pre-dicted to have and not have tidal features, reinforcing our strategy of excluding them from training.

Comparison to the Literature
In this Section, we compare our results to those from the literature.First, we default back to comparing the overall binary classification of our sample to match the work already undertaken on other samples.We then compare our multi-label results to these binary classifications of the literature.Finally, we identify whether streams in the DECaLS area identified by Martínez-Delgado et al. (2023) were identified by our network.

Binary Results
To begin the comparison of our work to that of W19, DS23, and DBL24, we performed a binary classification.That is, where we considered only if a tidal feature was present on the image regardless of what that tidal feature was.We constructed a binary training set by taking the five folds created earlier (see Section 4.3) and reassigning a positive label to galaxies with any kind of tidal feature.Similarly, those galaxies without tidal features were given a negative label.As with the multi-label scenario, the training sets had a roughly equal number of galaxies with and without tidal features, whereas the testing sets reflected a more realistic number of each.
As before, we preprocess the data by applying random augmentations and resizing the images to be 256 × 256.We trained each fold with ten different trial runs of the network for a maximum of 300 and a minimum of 40 epochs, with the median being 226.5 +44.1 −38.1 .The network was trained using the training portion of the fold, and the network's performance was evaluated on the unseen testing set.
Figure 10 shows the median ROC curve for the binary classification; again, the shaded area shows the 68.2 per cent confidence interval.It is evident from the Figure that our classifier has an outstanding performance, reaching an AUC score of 99.6 ± 0.1 per cent.ROC curves from W19, DS23, and DBL24 are also shown in the Figure.We also include the accuracy, precision, recall, and  1 metrics in Figure 9 and Table 4.The value of  opt for the binary classification was taken to be 0.5 rather than that which maximises the  1 score.Considering the performances in these metrics, the network is clearly excellent at distinguishing galaxies with and without tidal features.
However, we note that our binary sample is likely to be a highly idealised scenario.The sample has already been split according to a binary classification schema during the sample selection process, where we selected only galaxies where the W22 classifier indicated that there may be a potential tidal feature.Thus, we expect images where the network may struggle to predict whether a tidal feature is present or not to have been excluded from the binary classification set and that the performance on unbiased data could potentially be worse.Our multi-label set represents a new classification schema, so we expect this problem to be a lesser issue as those images have not been selected as good or bad at identifying specific categories of tidal features.

Multi-Label Comparison
Figure 11 presents the median TPR values for each category of tidal feature, the macro-and micro-averages, and the binary classification (from Section 5.2.1) for an FPR of 0.2, as well as the equivalent for W19, DS23, and DBL24.We note that the W19 values were reported at an FPR of 0.22 and the DS23 values were estimated by reading the .Median ROC curve for the binary classification with comparative literature results.The binary classification considers only positive and negative incidences of any tidal feature, unlike the multi-label classification, which aims to identify the type of tidal feature.Shaded regions indicate the 68.2 per cent confidence interval; again, the median was determined over ten network trials.We include both the single and best-performing ensemble classifiers from W19, both shallow and original samples from DS23, and the best-performing classifier and our determination of the median of other classifiers from DBL24.appropriate figure.We see in Figure 11 that, in terms of the TPR at an FPR of 0.2, the performance of our networks at identifying the tidal feature classes was comparable to the network trained by DBL24.
It is important to recall, however, that the detection of tidal features depends heavily on the limiting surface brightness magnitude of the survey (e.g.Johnston et al. 2008;Vera-Casanova et al. 2022).Thus, we expect that the tidal features are more clearly visible and delineated in deeper data, and as such, comparing network performances on varying depth data should take this into account.Indeed, the expectation that networks trained using deeper data will have better performance is consistent with the result of our comparison when considering the limiting surface brightness (  ∼ 28.8 this work;   ∼ 27.1 W19;   = 26 − 35 DS23;   ∼ 29.82 DBL24; in mag arcsec −2 ).

Stellar Stream Legacy Survey
Martínez-Delgado et al. ( 2023) introduced the Stellar Stream Legacy Survey to identify tidal features around galaxies in the local universe.They used data from the DESI Legacy Imaging Survey DR8, which includes DECaLS, as well as the Beĳing-Arizona Sky Survey and Mayall z-band Legacy Survey (BASS and MzLS, respectively; Dey et al. 2019).They re-reduced the raw Legacy Surveys data using a bespoke sky subtraction algorithm (see Martínez-Delgado et al. 2023, for details) and then visually inspected a sample of 689 galaxies.From this, they selected 24 galaxies with tidal disturbances -explicitly chosen to cover a range of morphologies, surface brightnesses, and distances -and used noisechisel (Akhlaghi & Ichikawa 2015) to confirm the detections.It is important to note that Martínez-Delgado  2023) focus on identifying any type of tidal disturbance that originates from the accretion of a low mass dwarf galaxy and that they do not attempt to separate their sample into different tidal feature classes.
We investigated the overlap between the 24 galaxies they identified and our sample.Of the 24 galaxies in their sample, we find only 18 of these were imaged in DECaLS DR5.Furthermore, 4 of these galaxies were not imaged with SDSS and, as such, were excluded from our sample due to the W22 selection criteria.None of the 14 remaining galaxies identified by Martínez-Delgado et al. (2023) were included in the galaxies we visually inspected because they did not have sufficiently high tidal disturbance scores from W22.As a reminder, our analysis focused on systems that had a major disturbance prediction from the GZD automated classifier that was greater than 0.4 whereas these 14 systems had predictions somewhere between 0.08 and 0.4.The reason why they did not score higher is beyond the scope of this work but could plausibly be due to genuine uncertainty in the volunteer classification, the size of the thumbnails presented to volunteers (e.g., too small to include the tidal feature, which in several cases lies at a large distance from the main galaxy) or inaccuracies in the classifier.
Although none of these 14 galaxies were inspected by us, they had high enough prediction scores to be excluded from our assumed undisturbed sample.Thus, we could at least use our network to generate predictions for each of these galaxies to discover if our classifier agrees that they are tidally disturbed.Encouragingly, all but two of the galaxies were identified by our network to be disturbed in some manner.This is quite a remarkable success given that our network operates on the standard imaging pipeline output without any advanced or bespoke processing to enhance the appearance of low surface brightness features beyond that of W22.We have provided a full breakdown of the predictions for each of the 14 galaxies in Table C1 in Appendix C. In most cases, the feature class predicted by our network matches reasonably well with the visual appearance of these galaxies.For the two cases where our network predicted no tidal features yet one is visible, we note that these particular features are very faint and fairly low significance according to Table 2 of Martínez-Delgado et al. (2023).The ability of our network to recover most of the Martínez-Delgado et al. ( 2023) galaxies as tidally disturbed without the need for advanced image processing is seen as very promising.

Grad-CAM Analysis
CNNs are prone to suffering from shortcut learning, where they can learn spurious connections between parts of the data and the desired output (see, e.g.Geirhos et al. 2020).This problem is an important issue, particularly for the generalisation of the network to unseen data, and verifying that this has not impacted the network should be a crucial part of any model development process.It is, therefore, vital to establish whether or not the network does indeed use the correct information to produce a prediction.In the context of this work, we must verify that for a given image, the network uses the galaxy and surrounding tidal features to adopt the prediction for that class.
We verified that the network was classifying the images as intended by performing a Gradient-weighted Class Activation Mapping (Grad-CAM; Selvaraju et al. 2017) analysis.In general, the analysis determines the gradient of a class score with respect to the feature map (the latent space representation) in the final convolutional layer.Then, it averages over this to assign an importance value to a particular neuron.The final heat map showing the regions of importance is generated by taking a linear combination of the importance values.
We selected the two best-performing models to analyse out of the ten attempts and five folds.These were the model with the lowest final validation loss and the model with the highest AUC score.Figure 12 provides some examples of the original images of the galaxies (leftmost column) and a version of the image scaled using Equation 1 (middle-left) alongside the Grad-CAM heat maps for the two models tested (two right-hand columns).Red regions on the heat map indicate areas of high importance, yellow and green lesser, and blue with little importance.
Figure 12 demonstrates, reassuringly, that the model identified the disturbed regions in the manner intended in almost all cases.In the case of shell galaxies, there is some evidence that the network also identifies background galaxies and stars as regions of importance.This effect might attributed to the outer regions of these objects appearing fuzzy and faint due to limited resolution and seeing, thus resembling a diffuse feature.Masking background sources and bright stars -which we have not done in the present analysis -may remove this confusion but it would come at the cost of significantly increasing the preprocessing required, which would add time and effort.Indeed, masking large samples would need to be done in an automated fashion, and there is a risk that such a procedure may perturb the appearance of the tidal features themselves.Still, we emphasise the need to strike a balance between masking unimportant background sources and maintaining the faint structure of the feature.Overall, we are very satisfied that the network appears to identify tidal features in the same manner that an expert would.

CONCLUSIONS
Tidal features are a crucial probe of the hierarchical formation of galaxies, and these features contain a wealth of information that could advance our understanding of galaxy evolution.Forthcoming wide-field surveys, such as LSST and Euclid, will provide ample data to study these features at much greater limiting surface brightness depths than before.However, the volume of data from these surveys will vastly outpace the current practice of classifying tidal features using visual inspection.
In this paper, we presented our attempts to counter this problem and demonstrate that it is possible to apply machine learning to classify faint tidal features.We used images of galaxies from the Dark Energy Camera Legacy Survey (DECaLS) to test if a Convolutional Neural Network (CNN) could classify different categories of tidal features.The images were compiled by Walmsley et al. (2022) and used to train an algorithm to reproduce volunteers' responses to questions designed to extract various morphological characteristics of the galaxy.We imposed several selection criteria using these predictions to generate a sample of 1,928 galaxies likely to have tidal features.
We visually inspected each galaxy in the sample to identify what tidal features were present, placing them into the non-exclusive categories of arm, stream, shell, diffuse, and uncertain.Of the 1,928 galaxies, 316 showed evidence of an arm feature, 283 a stream, 67 a shell, and 517 evidence of a diffuse feature.Furthermore, 219 galaxies showed more than one distinct type of feature.We note that a significant proportion of the galaxies could not reliably be said to have a tidal feature (labelled uncertain) due to the potential to be confused with intrinsically irregular galaxies -we excluded these from any further training set.Overall our sample represents one of the largest samples of galaxies with faint tidal features constructed to date.
Using this labelled set, we trained a CNN to predict what tidal features were present around each galaxy.In general, the performance was very good but depended on the feature in question, with the network appearing to be best performing for diffuse features overall.At a level of 20 per cent contamination, our classifier retrieved a median 98.7±0.3,99.1±0.5, 97.0±0.8, and 99.4 +0.2 −0.6 per cent of the true arm, stream, shell, and diffuse debris features respectively.These results are comparable to or better than other works in the literature that use machine learning to simply identify if any tidal feature is present (i.e.binary detection, Walmsley et al. 2019;Domínguez Sánchez et al. 2023;Desmons et al. 2024).Furthermore, we investigated if our network was able to recover galaxies with tidal features in the overlap between our galaxy sample and that of the Stellar Stream Legacy Survey (Martínez-Delgado et al. 2023).All but two of the galaxies in the overlap were identified by our network to be disturbed in some manner without the need for any advanced or bespoke processing.
We then used a grad-CAM analysis to verify that the network was indeed classifying the images as expected and that there were no spurious connections in the data.Overall, our results provide a compelling demonstration of the potential of CNNs to classify different categories of tidal features around galaxies, although the problem remains a difficult one with this work by no means being a definitive solution.With deeper surveys and further improvements to the classification process, we aim to build up a large sample of galaxies with tidal features and use them to quantitatively study the galaxy assembly process.

ACKNOWLEDGEMENTS
This is a pre-copyedited, author-produced PDF of an article accepted for publication in Monthly Notices of the Royal Astronomical Society following peer review.The version of record is available online at: https://academic.oup.com/mnras/advance-article/doi/10.1093/mnras/stae2169/7760393.For the purpose of open access, the author has applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising from this submission.
AJG acknowledges receipt of an STFC PhD studentship.AMNF is grateful for support from the UK STFC via grant ST/Y001281/1.
We thank Alice Desmons, Helena Domínguez Sánchez, Mike Walmsley, and Sarah Brough for their insightful and supportive discussions on the topic and for providing their data for our comparisons.Furthermore, we thank Sohan Seth for useful input on this work during its early stages.
This project uses data from the Dark Energy Camera Legacy Survey.The Legacy Surveys consist of three individual and complementary projects: the Dark Energy Camera Legacy Survey (DECaLS; Proposal ID #2014B-0404; PIs: David Schlegel and Arjun Dey), the Beĳing-Arizona Sky Survey (BASS; NOAO Prop.ID #2015A-0801; PIs: Zhou Xu and Xiaohui Fan), and the Mayall z-band Legacy Survey (MzLS; Prop.ID #2016A-0453; PI: Arjun Dey).DECaLS, BASS and MzLS together include data obtained, respectively, at the Blanco telescope, Cerro Tololo Inter-American Observatory, NSF's NOIR-Lab; the Bok telescope, Steward Observatory, University of Arizona; and the Mayall telescope, Kitt Peak National Observatory, NOIR-Lab.Pipeline processing and analyses of the data were supported by NOIRLab and the Lawrence Berkeley National Laboratory (LBNL).The Legacy Surveys project is honored to be permitted to conduct astronomical research on Iolkam Du'ag (Kitt Peak), a mountain with particular significance to the Tohono O'odham Nation.
NOIRLab is operated by the Association of Universities for Research in Astronomy (AURA) under a cooperative agreement with the National Science Foundation.LBNL is managed by the Regents of the University of California under contract to the U.S. Department of Energy.
This project used data obtained with the Dark Energy Camera (DECam), which was constructed by the Dark Energy Survey (DES) collaboration.Funding for the DES Projects has been provided by the U.S. Department of Energy, the U.S.  was reserved as validation data, which was used to test the network's performance at every epoch.The loss metric records how well output predictions from the network match the ground truth label.We used the binary cross-entropy loss commonly used for non-exclusive multi-label classifications.The loss

APPENDIX B: IMPROVING CLASSIFICATION
During the visual inspection process, reaching a complete consensus on the type of tidal feature was often challenging due to different interpretations of the instructions and characteristics that defined each feature or due to inconsistencies in individual inspectors.We found considerable noise when considering the classifications for a particular galaxy; all three authors only agreed on every feature in 151 galaxies (15.8 per cent of the sample with tidal features).As the network was trained to recreate our labels, which we assumed to be the ground truth, we thus hypothesise that if our classifications can be improved and the noise reduced, the network performance may improve.To that end, we investigated the impact of reclassifying some galaxies, with more precise instructions and enhanced images, on the consensus between authors and how this, in turn, impacted CNN performance.

B1 Reinspection
We improved the classification process by updating and clarifying the instructions in our decision tree.We included an initial question asking if any feature was present; if the inspector answered yes, they were then asked to identify what features.The choice of features was the same as in the original inspection, however we re-specified the characteristics of each feature and allowed the inspector to select whether they were confident or less confident that the feature was present.
We randomly selected 50 galaxies from our training sample to test the new classification decision tree.As before, each galaxy was given a label based on the responses from the three authors.Of the 50 galaxies, 36 per cent had no change to their label, 44 per cent had a minor change 7 , 12 per cent had a significant change 8 , and 4 per cent had a critical change where the label was completely different from the original.
We found overall that the reclassification led to no significant improvement in the agreement between the authors with the same number of galaxies retaining or losing labels as those that gained labels.However, in terms of the label's confidence, we found an impact.We based the confidence of the new label on the confidence options provided, i.e. confident or less confident the feature was present.The confidence of the old label was difficult to establish but was based on the agreement between authors and whether the galaxy received votes for uncertain.Overall, fewer labels increased confidence than those that suffered a drop in confidence.

B2 Retraining
We tested the impact that the reclassifications had on the performance of the CNN.For the original and new classifications of the 50 galaxies, we trained five trial runs of the network and compared the median ROC curves for these two sets.For this experiment, rather than generate folds, we randomly selected 20 per cent of the data to be testing data at the point of loading into the network.The network's performance was then evaluated on this testing data set and averaged over the five trials.As this is such a small number of galaxies, we expect the performance to be relatively poor, so we also randomly sampled 20 groups of 50 galaxies, using the original classifications, to have a baseline comparison to the network performance at that level.Figure 7 one additional or one fewer feature. 8more than one additional or fewer features or an exchange of features.Each of the models in this plot was trained using ∼30 galaxies with the rest left for testing, the same ratios as when the classifier was trained using the full sample.The green curve provides the median of 5 trial runs of training the classifier on the original classifications of the reclassified galaxies, and the blue provides the same using the new classifications.Finally, the yellow curve is the median of classifiers trained using 20 randomly sampled sets of the full original classifications of galaxies to serve as a baseline for improved performance.Shaded regions provide the 68.2 per cent confidence interval.
B1 shows the median ROC curves for the original classifications, the new classifications, and the random sampling with the respective 68.2 per cent confidence interval shaded.We see evidence of a slight deterioration in the ROC curve based on moving from the original classifications to the new classifications.However, the median AUC scores are within 1 of each other and there is no improvement from the randomly sampled baseline, with both the new and original AUC scores occurring well within 1 of the baseline.Instead of improving classifications, we suggest that other strategies, such as masking background sources, further hyper-parameter optimisation, deeper data, or transfer learning pre-trained networks, may be more effective at improving the performance of the network.Given the satisfactory performance of the network trained on the entire sample of original classifications (Section 4.4) and the time commitment to reclassifying all 956 galaxies, we decided not to continue with reclassifying all of the galaxies.

APPENDIX C: PREDICTIONS OF THE STELLAR STREAM LEGACY SURVEY GALAXIES
Table C1 presents the median predictions across each of our classes of tidal features for the galaxies introduced by Martínez-Delgado et al. (2023).The predictions were generated by providing the network with the images of the galaxies and then averaging over the outputs across all of the models trained (see Section 4.4 for further details).Included also are the labels assigned by the network according to the predictive threshold,  opt , for each category introduced in Table 4.

Figure 1 .
Figure1.Histogram of the 3 limiting surface brightness magnitudes for DECaLS images in the  (top),  (middle), and  (bottom) bands.The median (solid line) and 68.2 per cent confidence interval (dashed lines) were estimated by determining the limit in ∼2,000 FITS cutout images of galaxies.For each image, we followed the method set out inRomán et al. (2020) and used randomly sampled 10 × 10 arcsec 2 boxes.
(a) Merger question for DR1&2 campaigns.(b) Merger question for DR5 campaigns.

Figure 2 .
Figure 2. Galaxy Zoo DECaLS decision tree merger question presented to volunteers.The question aimed to establish if galaxies were merging or disturbed in some manner and changed between the DR1&2 (a) and DR5 campaigns (b).Both figures were adapted from Walmsley et al. (2022, W22).

Figure 3 .
Figure 3.The galaxy J094036.39+033436.9 as an example of the images presented to the inspector during the visual inspection process.The top left shows the image produced by W22, the top right shows the logarithmically scaled image, and the bottom row shows the novel pixel scaling algorithm introduced in this work and described by Equation 1 with the parameters  = max(pixel values),  = min(pixel values),  = 4, and  = 1 (left) or  = 0.5 (right).

Figure 4 .
Figure 4. Examples of each of the four categories of tidal features that we visually searched for in the DECaLS sample: arm, stream, shell, and diffuse.Each inspector would identify which features were present on the image, selecting all that were visible.Left-hand images were created by W22 and were used to train the network described in Section 4.1, right-hand images show the same galaxy using the Equation 1 pixel scaling with the parameters  = max(pixel values),  = med(background) − stdev(background),  = 4, and  = 0.5, where the background was estimated using a sigma clipping with a 3 clipping limit.
Overlap in classifications of feature categories.

Figure 6 .
Figure 6.Distributions of galaxy sample properties, including absolute and apparent r-band magnitudes, redshift, and half-light size.The distributions are split and normalised based on whether the galaxy had a tidal feature (red) or not (blue).The dashed line provides the median value.All properties are derived from values in the NASA Sloan Atlas (NSA) assuming the Planck Collaboration et al. (2020) cosmology.

( b )Figure 7 .
Figure7.Median Receiver Operating Characteristic (ROC) curves for each of the categories of tidal features (a) and the micro-and macro-averages (b).The median was calculated by combining the predictions for the testing sets across all five folds and averaging over ten independent trial runs of the network, and the shaded regions indicate the 68.2 per cent confidence interval.The ROC curve indicates how the fraction of correctly identified galaxies (TPR) and the fraction of incorrectly identified galaxies (FPR) change with respect to the predictive threshold -that being the value above which the output of the classifier would be taken to indicate a prediction the feature was present.The macro-average averages each of the classes' ROC curves, and the microaverage treats each label independently.

Figure 8 .
Figure 8. Histograms of the output predictions combined across all five folds.The histogram is split based on whether the truth label for that class is positive (red) or negative (blue).Each value shows the median fraction of positive and negative galaxies with predictions in that bin, and the error bar shows ±1.The black dashed line and shaded region shows the median ±1 maximum output prediction across all ten trials.

Figure 9 .
Figure9.The accuracy (a), precision (b), recall (c), and  1 (d) metrics across various predictive thresholds.The metrics were determined by combining the predictions for every galaxy across all five folds and splitting these into positive and negative based on the predictive threshold.These were then compared to the ground truth labels assigned to the galaxy for each of the given classes.The figure shows the median of ten attempts with the shaded area providing the 68.2 per cent confidence interval.Filled circles provide the value of each metric at the optimal threshold,  opt , as provided in Table4.
Figure10.Median ROC curve for the binary classification with comparative literature results.The binary classification considers only positive and negative incidences of any tidal feature, unlike the multi-label classification, which aims to identify the type of tidal feature.Shaded regions indicate the 68.2 per cent confidence interval; again, the median was determined over ten network trials.We include both the single and best-performing ensemble classifiers from W19, both shallow and original samples from DS23, and the best-performing classifier and our determination of the median of other classifiers from DBL24.

Figure 11 .
Figure11.Median TPR at a specified FPR of 0.2 for each class, micro-and macro-average, and a binary classification identifying any tidal feature.The median was calculated by combining the outputs of the five folds and then averaging over ten trial runs of the network, and the error bar represents the 68.2 per cent confidence interval.Additionally shown are the corresponding values from the binary classifiers of DBL24, DS23 for both the network trained with (grey square) and without (black circle) shallower data, and W19 for both a single CNN (grey square) and the average of two ensembles of CNNs (black circle).See the text for details.

Figure A1 .
Figure A1.Training (red) and validation (blue) loss metrics evaluated at every epoch during training.Each panel shows the average across the five folds in that attempt.The black dashed line provides the median number of epochs the network was trained for, with the shaded region providing the 68.2 per cent confidence interval.

Figure
A1 provides the training and validation losses across all epochs during training.Each panel shows the average across all of the five folds in that attempt, with the red line showing the training loss and the blue validation.The black dashed lines and shaded regions show the median epoch where training ended plus the 68.2 per cent confidence interval.In all cases, the validation loss is less than the training loss, but neither appears to diverge from the other, indicating that the network has not suffered from overfitting.

Figure B1 .
Figure B1.ROC curves retraining the on the new classifications.Each of the models in this plot was trained using ∼30 galaxies with the rest left for testing, the same ratios as when the classifier was trained using the full sample.The green curve provides the median of 5 trial runs of training the classifier on the original classifications of the reclassified galaxies, and the blue provides the same using the new classifications.Finally, the yellow curve is the median of classifiers trained using 20 randomly sampled sets of the full original classifications of galaxies to serve as a baseline for improved performance.Shaded regions provide the 68.2 per cent confidence interval.

Table 2 .
The architecture of the convolutional neural network.The architecture sets out the sequential (top to bottom) order of the layers of different operations, the number of nodes in each layer, the activation functions used, and the number of trainable parameters in that layer.Bracketed numbers show the number of nodes or parameters for the case of the binary network instead of the multi-label classification problem (see Section 5.2.1).See the text for details of the operations performed in each layer.
National Science Foundation, the Ministry of Science and Education of Spain, the Science and Technology Facilities Council of the United Kingdom, the Higher Education Funding Council for England, the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign, the Kavli Institute of Cosmological Physics at the University of Chicago, Center for Cosmology and Astro-Particle Physics at the Ohio State University, the Mitchell Institute for Fundamental Physics and Astronomy at Texas A&M University, Financiadora de Estudos e Projetos, Fundacao Carlos Chagas Filho de Amparo, Financiadora de Estudos e Projetos, Fundacao Carlos Chagas Filho de Amparo a Pesquisa do Estado do Rio de Janeiro, Conselho Na-cional de Desenvolvimento Cientifico e Tecnologico and the Ministerio da Ciencia, Tecnologia e Inovacao, the Deutsche Forschungsgemeinschaft and the Collaborating Institutions in the Dark Energy Survey.The Collaborating Institutions are Argonne National Laboratory, the University of California at Santa Cruz, the University of Cambridge, Centro de Investigaciones Energeticas, Medioambientales y Tecnologicas-Madrid, the University of Chicago, University College London, the DES-Brazil Consortium, the University of Edinburgh, the Eidgenossische Technische Hochschule (ETH) Zurich, Fermi National Accelerator Laboratory, the University of Illinois at Urbana-Champaign, the Institut de Ciencies de l'Espai (IEEC/CSIC), the Institut de Fisica d'Altes Energies, Lawrence Berkeley National Laboratory, the Ludwig Maximilians Universitat Munchen and the associated Excellence Cluster Universe, the University of Michigan, NSF's NOIRLab, the University of Nottingham, the Ohio State University, the University of Pennsylvania, the University of Portsmouth, SLAC National Accelerator Laboratory, Stanford University, the University of Sussex, and Texas A&M University.BASS is a key project of the Telescope Access Program (TAP), which has been funded by the National Astronomical Observatories of China, the Chinese Academy of Sciences (the Strategic Priority Research Program "The Emergence of Cosmological Structures" Grant # XDB09000000), and the Special Fund for Astronomy from the Ministry of Finance.The BASS is also supported by the External Cooperation Program of Chinese Academy of Sciences (Grant # 114A11KYSB20160057), and Chinese National Natural Science Foundation (Grant # 12120101003, # 11433005).The Legacy Survey team makes use of data products from the Near-Earth Object Wide-field Infrared Survey Explorer (NEOWISE), which is a project of the Jet Propulsion Laboratory/California Institute of Technology.NEOWISE is funded by the National Aeronautics and Space Administration.The Legacy Surveys imaging of the DESI footprint is supported by the Director, Office of Science, Office of High Energy Physics of the U.S. Department of Energy under Contract No. DE-AC02-05CH1123, by the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility under the same contract; and by the U.S. National Science Foundation, Division of Astronomical Sciences under Contract No. AST-0950945 to NOAO.During the training of a network, it is possible to record its performance and test for overfitting.To do this, a portion of the training data Murphy 2012.g.Murphy 2012) compares the truth label,   , to the output prediction,   .At every epoch, the training and validation sets were used to evaluate the loss metric; if these begin to diverge, this can indicate that the network is overfitting to the training data.