FIRST radio galaxy data set containing curated labels of classes FRI, FRII, compact and bent

Automated classification of astronomical sources is often challenging due to the scarcity of labelled training data. We present a data set with a total number of 2158 data items that contains radio galaxy images with their corresponding morphological labels taken from various catalogues [1,2]. The data set is curated by removing duplicates, ambiguous morphological labels and by different meta data formats. The image data was acquired by the VLA FIRST (Faint Images of the Radio Sky at Twenty-Centimeters) survey [3]. The morphological labels are collected and the catalogue specific classification definition is converted into a 4-class classification scheme: FRI, FRII, Compact and Bent sources. FRI and FRII correspond to the two classes of the widely used Faranoff-Riley classification [4]. We consider two more classes: compact sources and bent-tail galaxies. For duplicates with different morphological labels, the galaxy is regarded as ambiguously labeled and both coordinates are removed. For the remaining list of coordinates, the radio galaxy images are collected from the virtual observatory skyview (https://skyview.gsfc.nasa.gov/current/cgi/query.pl). The gray value images are provided in the size of 300 × 300 pixel and all pixels with a value below three times the local RMS of the noise are set to this threshold value. The data set is useful for the development of robust machine learning models that automate the classification of radio galaxy images.


b s t r a c t
Automated classification of astronomical sources is often challenging due to the scarcity of labelled training data. We present a data set with a total number of 2158 data items that contains radio galaxy images with their corresponding morphological labels taken from various catalogues [1 , 2] . The data set is curated by removing duplicates, ambiguous morphological labels and by different meta data formats. The image data was acquired by the VLA FIRST (Faint Images of the Radio Sky at Twenty-Centimeters) survey [3] . The morphological labels are collected and the catalogue specific classification definition is converted into a 4-class classification scheme: FRI, FRII, Compact and Bent sources. FRI and FRII correspond to the two classes of the widely used Faranoff-Riley classification [4] . We consider two more classes: compact sources and bent-tail galaxies. For duplicates with different morphological labels, the galaxy is regarded as ambiguously labeled and both coordinates are removed. For the remaining list of coordinates, the radio galaxy images are collected from the virtual observatory skyview ( https://skyview. gsfc.nasa.gov/current/cgi/query.pl ). The gray value images are provided in the size of 300 × 300 pixel and all pixels with a value below three times the local RMS of the noise are set to this threshold value. The data set is useful for the development of robust machine learning models that automate the classification of radio galaxy images.
© 2023 The Author(s

Value of the Data
• The data set is useful as it provides an easy to access, curated and combined data set based on various catalogues. This data set can be used to develop supervised deep learning models to classify radio galaxies in the categories FRI, FRII, Compact and Bent. • Computer scientists in the field of Astronomy and Astrophysics who are developing supervised, self-supervised or unsupervised deep learning models for automatic classification, object detection or data generation. Further, the data set can be used to validated and evaluated unsupervised models. • In combination with a labeled LOFAR data set, the data set can be used to develop a model that generalizes to data from another telescope operating in a different wavelength range. • The easy-accessible and curated data set is suitable for educational purposes in applying machine learning methods to astronomical data.

Objective
The data set is created to train supervised deep learning models on radio galaxy data. However, the available data set [12] has a limited number of 1256 data entries. The extraction of data from various catalogues turned out to be challenging because meta data is not consistent. In some catalogues, identical radio galaxy sources have different coordinates due to different rounding schemes. Further, identical radio galaxy sources can have different classification labels between different catalogues. In this data set, data items can easily be filtered by class label, catalogue or coordinate range. Researchers should have the ability to build on this data set and do not have to repeat this work.

Data Description
We combined different catalogues which characterise radio galaxy sources from the FIRST survey [3] to create a data set of radio galaxy images with morphological labels. The labeling is typically done by experts by considering radio images and the corresponding optical counterparts. In this work, we group radio sources in 4 classes (FRI, FRII, Compact and Bent) as done in [13,14] . FRI and FRII are defined by Fanaroff-Riley in [4] . The Compact class consists of unresolved point sources. The Bent class consists of sources for which the angle between the jets differs significantly from 180 degrees. It contains two subtypes: narrow-angle tail (NAT) and wide-angle tail (WAT) galaxies, depending on the angle between the jets. In Fig. 1 adapted from [1] , a few examples of each class are shown. The created data set has a total size of 5.1 MB.

firstgalaxydata/galaxy_data.zip
The data set can be found in the galaxy_data folder by unzipping galaxy_data.zip. It contains the folder structure   The number of radio galaxy sources per class and split are given in Table. 1 .

firstgalaxydata/galaxy_data_h5.zip
The combined data set collected from the FIRST catalogues is summarized in the HDF5 file galaxy_data_h5.h5 with a group named "data_$(i)" for every data entry with i = 1 , . . . , n with n as the total number of data entries. Each group has the following data sets • "Img": two-dimensional uint8 array with (30 0,30 0), The data set "Img" has the following attributes

firstgalaxydata/firstgalaxydata.py
This python class is able to load the galaxy_data_h5.h5 with several filtering options to provide a data.Dataset class for the pytorch framework.

firstgalaxydata/Example_firstgalaxydata.py
This file shows example code on how to use the firstgalaxydata.py class.

requirements.txt
The requirements.txt contains the necessary packages in order to use the firstgalaxydata.py class with python.

meta/FRICat_Capetti_2017_relabeled.csv
The FRICat_Capetti_2017_relabeled.csv contains the relabeled sources from FRICat Capetti catalogue as shown in Table. 2 .

meta/galaxy_data_removed.csv
The galaxy_data_removed.csv file contains a list of sources that are not included in the data set. These sources are already added to the data set but have slightly different coordinates within the deviation of ±0 . 015 • in right ascension (RA) and declination (DEC). Sources with different coordinates within the deviation are regarded as duplicates.

meta/galaxy_data_different_labels.csv
The galaxy_data_different_labels.csv contains a list of pairs of coordinates which are within the deviation of ±0 . 015 • in right ascension (RA) and declination (DEC) but in this case the label information is different from the original catalogues. These source coordinates are entirely excluded from the data set.
An image showning examples of the four classes.

Experimental Design, Materials and Methods
The morphological labels are collected and assigned to the corresponding class from the following catalogues. For mapping the labels to the radio galaxy images, the equatorial coordinates (J20 0 0) with right ascension (RA) and declination (DEC) are used. For the FRI class, we used the catalogue of [5,6] by collecting images from data tables CoNFIG-1 to CoNFIG-4 with label "I". Additionaly, images from the catalogue of [10] were collected with label "0" and the images from the catalogue of [7] were selected. For the FRII class, we used the catalogue of [5,6] by collecting images from data tables CoNFIG-1 to CoNFIG-4 with label "II" and "IIc". The catalogue of [10] with images with label "5" and "6" provided further images of FRII along with the images of the catalogue of [8] . For the Compact class, we used the catalogue of [5,6] by collecting images from data tables CoNFIG-1 to CoNFIG-4 with label "C", "C * " and "S * ". Further, the catalogue of [9] was used for Compact. For the Bent class, we used the catalogue of [5,6] by collecting images from data tables CoNFIG-1 to CoNFIG-4 with label "Iw". Further, we selected bent-type sources from [11] by collecting only from Table 1 from [11] with label "WAT" and "NAT". From catalogue [10] we collected images with label "1" and "2" for the Bent class.
We identified 300 source coordinates within a deviation of ±0 . 015 • in right ascension (RA) and declination (DEC) from different catalogues. 227 of 300 source coordinates had the same label information from different catalogues. Here, the radio galaxy image is only added once to data set and the listed 227 source coordinates are regarded as duplicates. 73 of 300 source coordinates had different label information from different catalogues. Here, the radio galaxy images are ambiguously labeled and excluded from the data set entirely. From the FRICAT catalogue [7] 9 sources were manually re-labeled from FRI to Bent since these sources are NAT galaxies. The coordinates of the re-labeled galaxies are listed in Table. 2 .

Preprocessing
We downloaded the images of the FIRST survey from the virtual observatory skyview ( https: //skyview.gsfc.nasa.gov ) using the equatorial coordinates (J20 0 0). The original images size is 30 0 x 300 pixel. We adopted the preprocessing and the choice of preprocessing parameters from [15] and [13] . At first, all NaN value are set to zero. Second, all pixel values below three times the local RMS noise are set to the value of this threshold with help of the following functions sigma_clipped_stats and preprocess_clip_normalize. The value of sigma equal to 3 ensures that