High frequency accuracy and loss data of random neural networks trained on image datasets

Neural Networks (NNs) are increasingly used across scientific domains to extract knowledge from experimental or computational data. An NN is composed of natural or artificial neurons that serve as simple processing units and are interconnected into a model architecture; it acquires knowledge from the environment through a learning process and stores this knowledge in its connections. The learning process is conducted by training. During NN training, the learning process can be tracked by periodically validating the NN and calculating its fitness. The resulting sequence of fitness values (i.e., validation accuracy or validation loss) is called the NN learning curve. The development of tools for NN design requires knowledge of diverse NNs and their complete learning curves. Generally, only final fully-trained fitness values for highly accurate NNs are made available to the community, hampering efforts to develop tools for NN design and leaving unaddressed aspects such as explaining the generation of an NN and reproducing its learning process. Our dataset fills this gap by fully recording the structure, metadata, and complete learning curves for a wide variety of random NNs throughout their training. Our dataset captures the lifespan of 6000 NNs throughout generation, training, and validation stages. It consists of a suite of 6000 tables, each table representing the lifespan of one NN. We generate each NN with randomized parameter values and train it for 40 epochs on one of three diverse image datasets (i.e., CIFAR-100, FashionMNIST, SVHN). We calculate and record each NN’s fitness with high frequency—every half epoch—to capture the evolution of the training and validation process. As a result, for each NN, we record the generated parameter values describing the structure of that NN, the image dataset on which the NN trained, and all loss and accuracy values for the NN every half epoch. We put our dataset to the service of researchers studying NN performance and its evolution throughout training and validation. Statistical methods can be applied to our dataset to analyze the shape of learning curves in diverse NNs, and the relationship between an NN’s structure and its fitness. Additionally, the structural data and metadata that we record enable the reconstruction and reproducibility of the associated NN.


a b s t r a c t
Neural Networks (NNs) are increasingly used across scientific domains to extract knowledge from experimental or computational data. An NN is composed of natural or artificial neurons that serve as simple processing units and are interconnected into a model architecture; it acquires knowledge from the environment through a learning process and stores this knowledge in its connections. The learning process is conducted by training. During NN training, the learning process can be tracked by periodically validating the NN and calculating its fitness. The resulting sequence of fitness values (i.e., validation accuracy or validation loss) is called the NN learning curve. The development of tools for NN design requires knowledge of diverse NNs and their complete learning curves. Generally, only final fully-trained fitness values for highly accurate NNs are made available to the community, hampering effort s to develop tools for NN design and leaving unaddressed aspects such as explaining the generation of an NN and reproducing its learning process. Our dataset fills this gap by fully recording the structure, metadata, and complete learning curves for a wide variety of random NNs throughout their training. Our dataset captures the lifespan of 60 0 0 NNs throughout generation, training, and validation stages. It consists of a suite of 60 0 0 tables, each table representing the lifespan of one NN. We generate each NN with randomized parameter values and train it for 40 epochs on one of three diverse image datasets (i.e., CIFAR-100, FashionMNIST, SVHN). We calculate and record each NN's fitness with high frequency-every half epoch-to capture the evolution of the training and validation process. As a result, for each NN, we record the generated parameter values describing the structure of that NN, the image dataset on which the NN trained, and all loss and accuracy values for the NN every half epoch. We put our dataset to the service of researchers studying NN performance and its evolution throughout training and validation. Statistical methods can be applied to our dataset to analyze the shape of learning curves in diverse NNs, and the relationship between an NN's structure and its fitness. Additionally, the structural data and metadata that we record enable the reconstruction and reproducibility of the associated NN.
© Value of the Data • The ubiquity of NNs has lead to significant investment in tools for NN design [1,2] . Development of such tools requires knowledge about diverse NNs and their learning curves (i.e., fitness throughout training) [3] . Existing NN repositories store only highly accurate NNs, together with their final fitness values, and do not include the full NN learning curves [4,5] .
Our dataset fills this gap by recording complete learning curves for a wide variety of random NNs. • Our data is relevant for researchers developing tools for NN design. Such tools include neural architecture search [6][7][8] and methods for NN fitness prediction and training termination [9][10][11] . Learning curve data is essential to the development of methods for NN fitness modeling and prediction [3,12] . • Researchers can use our dataset to study evolution of NN fitness during training and identify relationships between an NN's structure and its fitness on a given image dataset. For example, a researcher can analyze specific columns from each NN table in order to study the relationship between particular design elements of the NNs (e.g. learning rate; batch size; number, order, and type of layers) and the learning curves. • Parametric modeling of learning curves is increasingly used to model and predict fitness in machine learning applications [3] . Statistical methods can be applied to our dataset to analyze the shape of the learning curves. This enables researchers to identify families of functions that well model such curves and make informed choices about which modeling functions to employ in parametric modeling methods [12] . • Our data can advance effective searches for accurate NNs, which have a far-reaching impact on many fields. Accurate NNs can be used to extract structural information from raw microscopy data [13] , detect IO interference in batch jobs [14] , predict performance of business processes [15] , predict soil moisture or maize yield [16] , detect rare transitions in molecular dynamics simulations [17,18] , analyze cancer pathology data [19] , and map protein sequences to folds [20] .

Data Description
We define a taxonomy of the random NNs that we generated and trained to build our dataset. Fig. 1 depicts the structure of our NNs. Each NN is composed of two sections, Feature Extraction and Classification . The Feature Extraction section of the NN consists of convolutional and nonlinear layers; we alternate convolutional layers and non-linear layers such that each convolutional layer is followed by at least one and at most three non-linear layers before any other convolutional layer is applied. The Classification section of the NN consists of fully connected layers, with possible dropout layers in between.
Our data depict the lifespan of 60 0 0 NNs throughout generation, training, and validation stages, across 40 epochs of training, with fitness values captured every half epoch. The NNs are randomly generated using our taxonomy. The dataset consists of 60 0 0 tables, each with 28 columns and 81 rows, together with a Python script that demonstrates how to load the data into a Pandas DataFrame and how to calculate and save metrics of interest like mean accuracy or the NN's learning rate. The dataset is publicly available in the Harvard Dataverse repository: https://doi.org/10.7910/DVN/ZXTCGF . The data format is tabular: the information is organized in .txt files. Each.txt file contains a single table capturing the lifespan of one NN. Each table contains 81 rows and 28 columns. The first row stores the column names, and the remaining 80 rows correspond to every half epoch throughout the lifespan of the NN, beginning at epoch 0.5, and ending at epoch 40. The columns correspond to the fitness data and the metadata that we track throughout the lifespan of the NNs. The first four columns contain training and validation data of the NN; these values change throughout the lifespan of the NN, and hence these columns populate all rows. The remaining columns contain metadata describing the generation of the NN and its structure; these values do not change throughout the lifespan of the NN and thus are only recorded in the second row. From left to right the columns of each NN the value is 0. 17. layer_types : The type of non-linear layers following each convolution, reported in a hyphen separated list of integers. Each integer corresponds to one convolutional layer, and its value encodes the block of non-linear layers following that convolutional layer. The integers are recorded consecutively, beginning with the integer corresponding to the first convolutional layer. Table 1 depicts the block of non-linear layers encoded by each integer value. For example, layer_types = 1-5-2, would mean the first convolution is followed by a ReLU layer, the second convolution is followed by a ReLU layer and then a dropout layer, and the third convolution is followed by a pooling layer. if no dropout layers are added in Classification section, the value is 0. 27. FC_dropout_layers : A hyphen separated list of integers denoting which fully connected layers are followed by dropout layers. The integers represent boolean values-0 for False, 1 for True. The integers correspond to consecutive fully connected layers, beginning with the first one, where a value of 0 means the current fully connected layer is not followed by a dropout layer, and a value of 1 means the current fully connected layer is followed by a dropout layer. 28. FCFilters : A list of the number of filters of each fully connected layer. The number of filters are reported in a hyphen separated list of integers. Each integer is the number of filters for one fully connected layer. The number of filters are recorded consecutively, beginning with the first fully connected layer.
Our dataset amounts to 109.4MB of data distributed in 60 0 0 tabular files. Because of the significant size of the dataset, we do not include the full dataset in the text of this paper. The full dataset can be downloaded from our public Harvard Dataverse repository; a link is included in the Specifications Table under "Data Accessibility." Table 2 gives an example of the first 3 rows of one of these 60 0 0 NN tabular.txt files: the first row contains the column names; the second row contains the training and validation data at epoch 0.5 as well as the metadata describing generation of the NN and its structure; the third row contains the training and validation data at epoch 1.0. The remaining 78 rows contain training and validation data for each consecutive half epoch up through 40 epochs; we do not include these rows in the paper because of space constraints. The full table can be found in our dataset. The NN represented in this table has unique ID "2021_02_15_12_18_09_100387" (row 2, column 5 of Table 2 ) and is trained on CIFAR-100 (row 2, column 9 of Table 2 ). The path to this table in our dataset is "CIFAR-100_models/2021_02_15_12_18_09_100387.txt".
Our dataset includes the python script DataLoader.py . This script shows how to load the tabular .txt files into a Pandas DataFrame, isolate columns of interest, perform computations (e.g. calculating max, min, or mean values of the accuracy or loss of an NN over its lifespan), aggre- gate computations for all NNs into a single DataFrame, and save the aggregate calculated metrics in a .csv file.

Neural Network Generation
For each of our three image datasets (i.e., CIFAR-100, FashionMNIST, and SVHN) we generate 20 0 0 NNs with random parameter values and train them on that image dataset, for a total of 60 0 0 NNs described in our dataset. We generate each NN according to the structure in Fig. 1 with uniformly randomized parameter values from the intervals defined in Table 3 .
Zooming into Table 3 , we generate three different sets of parameters (i.e., Feature Extraction Parameters, Classification Parameters, and Training Parameters).

• Feature Extraction Parameters
On each of our three image datasets, we generate 20 0 0 NNs, 20 0 each with x number of layers, for 1 ≤ x ≤ 10 . This ensures that the number of convolutional layers of the NNs is uniformly distributed between 1 and 10. For each convolutional layer, we randomize kernel, stride, and padding values, as well as the number of filters. Often, NNs are structured so that the number of filters for the convolutional layers increases with each layer. We generate some NNs whose convolutional filters increase sequentially, but we do not restrict our data to only NNs with this property. We achieve this by generating a random boolean for each NN that determines whether or not to increase the number of filters in each sequential convolutional layer. Fig. 2 shows the process to randomize the number of filters for the convolutional layers of each NN, depending on the value of the boolean and the position of the layer in the NN. If the number of filters is not required to increase, then the number of filters for the last convolution C c is chosen uniformly in the range [ number of classes , 400], and the number filters for each convolution C i , i < c, is always chosen uniformly     Fig. 1 , each convolutional layer is followed by non-linear layers randomized from the following types: ReLU, dropout, or pooling. We randomize kernel, stride, and padding values for each pooling layer, and we choose a dropout rate to use for all dropout layers in the Feature Extraction section.

• Classification Parameters
We randomize the number of fully connected layers, and we generate a random boolean that determines whether or not to allow any dropout layers between pairs of fully connected layers. If the boolean is T rue , we randomize the dropout rate to use for all dropout layers in the Classification section. Then, for each fully connected layer except the final one, we generate a random boolean to decide whether or not to add a dropout layer after this fully connected layer. If the boolean is F alse , the dropout rate is 0, and we do not add any dropout layers in the Classification section.

• Training Parameters
We randomize the learning rate, momentum, dampening, and weight decay to use for training the NN. As depicted in Fig. 3 a, for the training parameters momentum, dampening, and weight decay, we generate a random boolean to determine whether or not to activate that parameter. If the parameter is not activated, we set the parameter's value to 0. Finally, we randomize the batch size to use for training. The procedure is given in Fig. 3 b. We randomize batch size uniformly between 25 and 250 and truncate the training image dataset to be divisible by batch size. Because we validate every half epoch, we also need the number of samples in the truncated dataset to be divisible by twice the batch size. If this divisibility condition is not met, we re-randomize batch size until it is.

Neural Network Training and Validation
As noted earlier, each of our generated NNs is trained on one of three image datasets: CIFAR-100, F-MNIST, and SVHN. Each of these image datasets comes partitioned into training and testing sets; we use the training set for training the NNs and the testing sets for validating the NNs. The CIFAR-100 dataset contains 50,0 0 0 images for training and 10,0 0 0 images for testing. The F-MNIST dataset contains 60,0 0 0 images for training and 10,0 0 0 images for testing. The SVHN dataset contains 373,257 images for training and 26,032 images for testing.
Each generated NN trains on the training set of one of these three datasets for a total of 40 epochs. Every half epoch throughout training, the training loss is recorded and training is paused in order to validate the network on the testing set and record the NN's validation accuracy and validation loss. After validation, training resumes for the next half epoch. All neural networks are trained using stochastic gradient descent. The loss criterion used is cross entropy loss.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.