Published July 14, 2022 | Version v3
Journal article Open

Datasets for Data-Centric Classification and Clustering

Description

This is the technical descriptions of the used datasets in the paper "A data-centric approach for improving ambiguous labels with combined semi-supervised classification and clustering" (https://arxiv.org/abs/2106.16209). The source code is available at https://github.com/Emprime/dc3 . We provide as summary taken from the original work and technical descriptions for all datasets:

The Plankton dataset was introduced in "Fuzzy Overclustering: Semi-supervised classification of fuzzy labels with overclustering and inverse cross-entropy" (https://doi.org/10.3390/s21196661). The dataset contains 10 plankton classes and has multiple labels per image due to the help of citizen scientists. In contrast to the previous work, we include fuzzy images in the training and validation set and do not enforce a class balance which results in a slighlty different data split. Moreover, we preprocessed the data by recentering the images and removing artifacts like scale bars.

The Turkey dataset was used in "Learn to train: Improving training data for a neural net-
work to detect pecking injuries in turkeys" and "Keypoint Detection for Injury Identification during Turkey Husbandry Using Neural Networks". The dataset contains cropped images of potential injuries which were separately annotated by three experts as not injured or injured.

The Mice Bone dataset is based on the raw data which is available at https://doi.org/10.5281/zenodo.3355936 .The raw data are 3D scans from collagen fibers in mice bones. The three proposed classes are similar and dissimilar collagen fiber orientations and not relevant regions due to noise or background. We used the given segmentations to cut image regions from the original 2D image slices which mainly consist of one class.

The CIFAR-10H dataset (https://github.com/jcpeterson/cifar-10h) provides multiple annotations for the test set of CIFAR-10.

Technical description

Each folder represents one dataset. The subfolders train, val and unlabeled represent the used data splits Training, Validation and Unlabeled respectively. The used ground-truth labels is given a folder name for each image. Each image is one datapoint.

The filenames for the plankton data are the original Ecotaxa ID (https://ecotaxa.obs-vlfr.fr/).The filenames for the turkey data are a random number and the class 0 means not injured and class 1 is injured. The filesnames for the Mice Bone dataset are <ORIGNAL SEGMENTATION>#<SCAN ID>#<SLICE NUMBER>#<COUNTER>.png. The filenames for the CIFAR-10H dataset are randomly generated counters.

Each dataset has an additional dataset_import.json and an annotations.json file. The first file contain basically all information (image path, class names, datasplit and groundtruth class) as the file structured explained above. The second file contains the raw annotations per file. These annotations can be used to approximate the underlying ground truth distribution. The provided ground truth label was randomly selected from the approximation of the underyling ground truth based on this file.

License Information

CIFAR-10H is already published at https://github.com/jcpeterson/cifar-10h under Creative Commons BY-NC-SA 4.0 license.
Full license information at  https://creativecommons.org/licenses/by-nc-sa/4.0/
The CIFAR-10 image and original label data can be found at: https://www.cs.toronto.edu/~kriz/cifar.html
The data was reformatted for this paper and is republished under Creative Commons BY-NC-SA 4.0 license.

All other datasets (Plankton, Turkey, Mice Bone) are adapted works of previous publications. See above or below in the citation for the original works.
The are republished under Creative Commons BY-SA 4.0 license. Full license information at  https://creativecommons.org/licenses/by/4.0/

Citation

Be aware that you need to reference this and previous works if you want to use this data.
Please cite as:

@Article{schmarje2022dc3,
AUTHOR = {Schmarje, Lars and Santarossa, Monty and Schröder, Simon-Martin and Zelenka, Claudius and Kiko, Rainer and Stracke, Jenny and Volkmann, Nina and Koch, Reinhard},
TITLE = {A data-centric approach for improving ambiguous labels with combined semi-supervised classification and clustering},
JOURNAL = {Proceedings of the European Conference on Computer Vision (ECCV)},
YEAR = {2022},
}

Original data  in
 

@Article{Schmarje2021foc,
AUTHOR = {Schmarje, Lars and Brünger, Johannes and Santarossa, Monty and Schröder, Simon-Martin and Kiko, Rainer and Koch, Reinhard},
TITLE = {Fuzzy Overclustering: Semi-Supervised Classification of Fuzzy Labels with Overclustering and Inverse Cross-Entropy},
JOURNAL = {Sensors},
VOLUME = {21},
YEAR = {2021},
NUMBER = {19},
ARTICLE-NUMBER = {6661},
URL = {https://www.mdpi.com/1424-8220/21/19/6661; https://doi.org/10.5281/zenodo.5550919},
ISSN = {1424-8220},
DOI = {10.3390/s21196661}
}

@article{peterson2019cifar10h,
author = {Peterson, Joshua and Battleday, Ruairidh and Griffiths, Thomas and Russakovsky, Olga},
doi = {10.1109/ICCV.2019.00971},
eprint = {1908.07086},
isbn = {9781728148038},
issn = {15505499},
journal = {Proceedings of the IEEE International Conference on Computer Vision},
pages = {9616--9625},
title = {{Human uncertainty makes classification more robust}},
volume = {2019-Octob},
year = {2019}
}

@article{volkmann2021turkey,
author = {Volkmann, Nina and Br{\"{u}}nger, Johannes and Stracke, Jenny and Zelenka, Claudius and Koch, Reinhard and Kemper, Nicole and Spindler, Birgit},
doi = {10.3390/ani11092655},
journal = {Animals 2021},
pages = {1--13},
title = {{Learn to train: Improving training data for a neural network to detect pecking injuries in turkeys}},
volume = {11},
year = {2021}
}

@article{schmarje2019,
author = {Schmarje, Lars and Zelenka, Claudius and Geisen, Ulf and Gl{\"{u}}er, Claus-C. and Koch, Reinhard},
doi = {10.1007/978-3-030-33676-9_26},
eprint = {1907.12868},
isbn = {9783030336752},
issn = {23318422},
journal = {DAGM German Conference of Pattern Regocnition},
pages = {374--386},
publisher = {Springer},
title = {{2D and 3D Segmentation of uncertain local collagen fiber orientations in SHG microscopy}},
volume = {11824 LNCS},
year = {2019}
}





 

Files

DC3-Datasets.zip

Files (375.6 MB)

Name Size Download all
md5:06762087068ca8363246a79a1535d5fe
375.6 MB Preview Download