2D geometric shapes dataset – for machine learning and pattern recognition

In this paper, we present a data article that describes a dataset of nine 2D geometric shapes, and each shape is drawn randomly on a 200 × 200 RGB image. During the generation of this dataset, the perimeter and the position of each shape are selected randomly and independently for each image. The rotation angle of each shape is chosen randomly for each image within an interval between -180° and 180°, as well as the background colour of each image and the filling colour of each shape are selected randomly and independently. The published dataset is composed of 9 classes of data, and each class represent a type of geometric shape (Triangle, Square, Pentagon, Hexagon, Heptagon, Octagon, Nonagon, Circle and Star). Each class is composed of 10k generated images. This paper also includes a GitHub URL to the generator source code used for the generation, which can be reused to generate any desired size of data. The proposed dataset aims to provide a perfectly clean dataset, for classification as well as clustering purposes. The fact that this dataset is generated synthetically provides the ability to use it to study the behaviour of machine learning models independently of the nature of the dataset or the possible noise or data leak that can be found in any other datasets. Moreover, the choice of a 2D geometrical shape dataset provides the ability to understand as well to have good knowledge of the number of patterns stored inside each data class.


a b s t r a c t
In this paper, we present a data article that describes a dataset of nine 2D geometric shapes, and each shape is drawn randomly on a 200 × 200 RGB image. During the generation of this dataset, the perimeter and the position of each shape are selected randomly and independently for each image. The rotation angle of each shape is chosen randomly for each image within an interval between -180 °and 180 °, as well as the background colour of each image and the filling colour of each shape are selected randomly and independently. The published dataset is composed of 9 classes of data, and each class represent a type of geometric shape (Triangle, Square, Pentagon, Hexagon, Heptagon, Octagon, Nonagon, Circle and Star). Each class is composed of 10k generated images. This paper also includes a GitHub URL to the generator source code used for the generation, which can be reused to generate any desired size of data. The proposed dataset aims to provide a perfectly clean dataset, for classification as well as clustering purposes. The fact that this dataset is generated synthetically provides the ability to use it to study the behaviour of machine learning models independently of the nature of the dataset or the possible noise or data leak that can be found in any other datasets. Moreover, the choice of a 2D geometrical shape dataset provides the ability to understand as well to have good knowledge of the number of patterns stored inside each data class. The geometric shapes were generated by using the turtle library and the Python programming language.

Data format
Raw Parameters for data collection Each geometric shape was drawn on a 200 × 200 pixel image, and each shape has a random background colour, random filling colour, a random position inside the image, an arbitrary rotation angle between -180 °and 180 °, and a random perimeter under the condition that the drawn shape must be fully visible inside the containing image.

Description of data collection
Each geometric shape was drawn 10 0 0 0 times under the conditions described above, and each image was saved into a PNG image.

Data source location
Institution Value of the data • The simulated dataset can be used to train machine learning models to perform geometric shape recognition tasks. • This dataset could be useful for researchers studying the overfitting of machine learning models [2] since the classification of the images inside this dataset can only be performed with a model that focuses on pattern detection [1] instead of memorization. • This dataset could be useful for researchers to benchmark and test any regularization techniques that can be applied to machine learning models or any other approach that can help to avoid the overfitting. • This dataset consists of a set of images containing random drawn geometrical shapes, which provides a clean dataset from any data leak that would influence the benchmark of any machine learning model. Examples:

Data description
The background colour of each image, the filling colour of each shape, the rotation angle of each shape, the position of each shape inside the image, as well the perimeter of each shape is generated randomly and independently.

Experimental design, materials, and methods
The generation program was developed using the Python programming language and the Turtle library to draw geometric shapes inside each image as well we have used the NumPy library to generate all the random parameters that we have used during the generation. The generation program receives as an input the number of desired samples per each shape as well as the path of the storage directory where the generated images will be stored. The goal of the tool is to generate a set of labelled images that contains 2D geometric shapes.
Each image is generated with the following parameters: • A fixed image size of 200 × 200 pixel • A random background colour: the background colour is generated randomly by using the function numpy. random.randInt(0, 255, 3). This function generates three random integers from the discrete uniform distribution in the half-open interval of 0 and 255. Each number represent respectively, the RGB values that construct the background colour. We ignore the possibility that the background colour may be different or close to the filling colour as we consider it as a normal task for a machine learning model, to be able to identify the drawn shape with a filling colour close to the background colour. • A random filling colour of each shape: the filling colour of each shape is generated by using the same setting described in the generation of the background colour. Note that the generation of the filling colour is independently from the generation of the background colour (we ignore the very low probability that the background colour would be the same as the filling colour). • A random rotation angle between -180 °and 180 °• A random shape perimeter • A random shape position inside the containing image The perimeter and the position of each shape inside each image are selected as the following.
• First, we select the radius of the circumscribed circle of the drawn shape randomly, and this selection is performed within the interval of 10 and 75 (pixels), which ensures that the drawn shape may have a random perimeter between the interval of 10 and 75. Also, it gives that the smallest possible shape may be circumscribed by a circle with a radius = 10 pixel, and the largest possible shape may be circumscribed by a circle with a radius = 75 pixel. In Table 1 we show the minimal possible perimeter of each shape the maximal possible perimeter for each shape, the minimal possible area of each shape and the maximal possible area of each shape based on the parameters described above. • Then, in the function of the selected radius value, we choose the position of the centre of the circumscribed circle of the drawn shape inside the image randomly, the position is selected under the condition that all the drawn shape must be fully visible inside the image. • And finally, we draw the shape in the function of the centre and position of its circumscribed circle. Note that the circumscribed circle of the shape is used only to draw the shapes, and it is not visible in the generated images ( Fig. 2 ).
All the parameters described above are generated for each sample independently and identically. All the random values are selected by using the Numpy function numpy.random.randInt , which generates a defined number of random values from the uniform discrete distribution in the half-open interval specified in the function call arguments.

A use case example with convolutional neural networks
This dataset could be used to train a convolutional neural network to perform geometric shape recognition and classification. To do so, this dataset must be divided into two separate sets of data a training dataset and a testing dataset, the selection of samples that composes each collection of data is made randomly: • A training dataset which consists of 40% of samples of the original dataset • A testing dataset which consists of 60% of samples of the original dataset The architecture of the convolutional neural network is defined as the following: 7 convolutional layers and one fully connected layer, which represents the output layer of the neural network. Each convolutional layers is illustrated with a parametrized number of filters a batch normalization layer [3] an activation function, and a max-pooling layer follows each convolutional layer. Fig. 1 shows the convolutional neural network architecture that can be used on this dataset.