An Optimized Convolutional Neural Network Architecture Based on Evolutionary Ensemble Learning

: Convolutional Neural Networks (CNNs) models succeed in vast domains.CNNs are availablein a variety of topologies and sizes. The challenge in this area is to develop the optimal CNN architecture for a particular issue in order to achieve high results by using minimal computational resources to train the architecture. Our proposed framework to automated design is aimed at resolving this problem. The proposed framework is focused on a genetic algorithm that develops a population of CNN models in order to find the architecture that is the best fit. In comparison to the co-authored work, our proposed framework is concerned with creating lightweight architectures with a limited number of parameters while retaining a high degree of validity accuracy utilizing an ensemble learning technique. This architecture is intended to operate on low-resource machines, rendering it ideal for implementation in a number of environments. Four common benchmark image datasets are used to test the proposed framework, and it is compared to peer competitors’ work utilizing a range of parameters, including accuracy, the number of model parameters used, the number of GPUs used, and the number of GPU days needed to complete the method. Our experimental findings demonstrated a significant advantage in terms of GPU days, accuracy, and the number of parameters in the discovered model. results indicate that the number of parameters in several models, especially DSC-based CNN models, is less than one million trainable parameters, except for ImageNet.

of CNN models. To address this issue, a research path is taken that focuses on the automation of CNN design through artificial intelligence techniques. The approaches to automated CNN programming are classified into a variety of methodologies, the majority of which are focused on Evolutionary Algorithms (EAs), such as Particle Swarm Optimization (PSO) or Reinforcement Learning (RL) [4]. For RL-based methods, such as those described in [5][6][7][8][9][10], these methods rely heavily on recurrent networks to serve as the controller for model generation. This approach consumes a significant number of computational power. For instance, in Neural Architecture Search (NAS) [5], the process requires 800 Graphical Processing Units (GPUs) over a three-week span. The successor researchers attempted to reduce computation costs by efficiency enhancements, as described in [6][7][8][9][10], but the RL methods still need a massive computational overhead, ranging from 400 GPUs for 4 days to more than 30 GPUs for 3 days.
For the PSO-based methods described in [11][12][13], the researchers must discretize the naive PSO, as architecture design is a discrete optimization problem and PSO is continuous in nature [14]. Additionally, the binary PSO inherits certain limitations, one of which is its poor convergence rate [15]. As a consequence, the PSO must be changed to account for the fact that it demands additional work overhead in addition to its high computational expense, which is particularly visible for broad image datasets [16]. EA-based methods are paradigmatic to natural selection [17]. Parallelism is enabled by EAs, which assists in preventing local optima. Genetic Algorithms (GAs) are the most often used EA technique [18]. GAs have been applied in a broad variety of domains and have demonstrated their ability to solve a variety of optimization problems, especially multi-objective problems [19]. In the CNN automated design domain, GAbased methods such as those described in [20] achieved approximately the same efficiency as RL-based methods thus using significantly less resources and time.
When developing lightweight architectures, the designer must take into account the number of parameters, as in MobileNet [21] and SqueezeNet [22]. MobileNet and SqueezeNet have used a mix of basic filters to reduce the architecture's trainable parameters, thus speeding up training [23]. The issue is that since accuracy is proportional to the number of parameters in the model, the designer must prioritize model size while maintaining accuracy, which is deemed a difficult task [24]. Additionally, manually constructing these models is a difficult challenge due to the designer's need to handle the various forms of layers and their parameters.

Novelty and Contribution
To create our proposed paradigm, we considered the following hypotheses: (1) manually designing CNNs is a time-consuming process that results in suboptimal topologies; (2) using lightweight CNN building blocks will result in a reduction in the number of parameters in discovered models; (3) produced structures would outperform manually designed structures. We consider introducing an automated GA-driven paradigm based on these working hypotheses. The proposed structure is novel in that it completely automates the process of defining lightweight CNN architectures that better match the defined domain dataset in terms of validation accuracy and parameter count.
The achievements estimated to make the study effective and have an effect on the science domain are grouped into three major categories in this proposed work: (1) Architecture Search Method: Introducing an encoding method for variable-length multi-level chromosomes that denotes the CNN model topology that can be used by a specified evolutionary algorithm approach; (2) Architecture Building Elements: Dealing with lightweight building blocks to customize the proposed framework in order to generate CNN topologies with a small number of parameters with predefined parameters; and (3) Ensemble Learning: Build a tailored ensemble at the completion of our system to improve the performance of a committee of the best generated CNN architectures obtained via the proposed framework's search mechanism.

Related Work
This section would discuss similar work on automated CNN design methods that use GA. In [20] the authors addressed a technique named Genetic CNN "GeNet" for optimizing CNN architectures by the use of GA. It is based on graph evolution and operates by connecting various convolutional layer nodes. Since the chromosomes are set in duration in this encoding process, the number of CNN nodes and stages must be predefined, preventing the exploration of several different CNN structures. Additionally, since the GeNet method only encodes layer connections, it does not accept other hyperparameters such as the number of generated feature maps, kernel size, dropout rates, or layer pooling.
In [25], a system dubbed "EDEN" was proposed that utilizes GA with two genes chromosomes reflecting the learning rate and a CNN structure. The learning rate gene encodes the value used to train each produced structure. The structure gene specifies the order of the CNN layers and the sort of operations performed by each layer. However, this method of encoding chromosomes with a fixed size results in shallow CNN-generated topologies; additionally, this method does not support skip connections or layer pooling. In [26], the authors addressed a completely automated approach-based GA named "AE-CNN" that evolves CNN topologies using blocks from ResNet and DenseNet. The developers asserted that their algorithm does not need any predefined expert experience to operate efficiently. This strategy needs several GPU days to complete. The authors of [27] suggested a methodology dubbed "CGP-CNN" that employs Cartesian Genetic Programming (CGP) for CNN topology generation. This strategy considers six distinct layers. Due to the predefined matrix dimension, CGP-CNN can only explore a finite number of CNN constructs. According to their experiments, the cost of CGP-CNN calculation is extremely high due to the time-consuming nature of the CNN fitness assessment method.
In addition to the majority voting ensemble, the authors in [28] utilized pre-trained CNN models in the initial population. However, since the produced models are based on basic building elements, the performance accuracy is poor. Recent work in [4] developed a completely automated design algorithm dubbed "CNN-GA" for generating a chromosome based on real numbers. The primary disadvantage is that the chromosome number is predetermined. Additionally, they neglected to account for the completely linked layers that would be encoded inside the created chromosome. We assume that the following gaps exist between the above associated approaches: (1) The majority of similar methods are impractical since they need a large amount of computational power and time to operate; (2) The design of lightweight CNN topologies was not contemplated in these approaches; and (3) The majority of methods did not address the possibility of integrating the generated models within an ensemble structure.

The Proposed Framework
The proposed framework's overall workflow is depicted in Fig. 1, and it is divided into three major phases: (1) it begins with the generation of an initial random CNN population for use in the GA search process; (2) it uses the GA-based search algorithm to navigate the solution space; and (3) it uses the customized stack ensemble technique to improve the overall output validation accuracy from the GA search process. In the following sub-sections, we describe each part in depth to demonstrate how it works.

Encoding Method
The proposed framework employs an encoding technique to construct the GA chromosomes that describe the created CNN architectures. Unlike the GeNet method, which is based on a onedimensional fixed binary encoded chromosome, we suggested a variable multi-level chromosome that encodes CNN parameters using real strings. The following are the benefits of using a multilevel encoding scheme for our chromosomes: (1) It supports a variety of data types; (2) The chromosomes may be expanded in terms of layers inside each block. This chromosome represents a set of blocks; the suggested encoding procedure arbitrarily initializes the number of blocks. Each block is composed of many layers that are randomly initialized. The layer is composed of CNN components.
These components are stored in the components list L c , which allows the encoding algorithm to choose one or more elements from which to build the layer inside the block. The convolutional module (F), the ReLU activation function (R) [29], batch normalization (B) [30], and dropout units (D) [31] are the layer components. The convolutional module can be Normal Convolution (NC), Depth-Wise Separable Convolution (DSC), or squeeze net fire. The use of various types of convolution, particularly the second and third, enables us to meet the requirement for a lightweight CNN architecture. The method specifies the output (Out) for each block in the produced chromosome. At the end of the chromosome, there is a final block that indicates the existence or absence of fully connected layers in the created CNN model.

Architecture Search Framework
The central component of our proposed system is the GA-based search approach [32]. The proposed framework begins by initializing some of GA's primary parameters. A random population of initial CNN architectures is produced using the proposed technique of encoding the CNN chromosomes as nested layers inside sequential blocks. The proposed structure procedure is based on a central iteration loop that regulates evolution through generation. The learning rate is known to be the most critical hyperparameters to tune while training deep CNNs. The Cyclic Learning Rate (CLR) [33] is used in our case since it practically eliminates the need to find the optimal values and schedule for global learning rates experimentally. Early Stopping (ES) is another technique that is used during the training phase [34] and it is a well-known strategy for reducing overfitting during training. This technique significantly reduces training time, as we are training a large number of different created architectures in our case.
The trained CNN is then validated using the given validation dataset in the second stage. The third operation is an assessment process that involves computing the CNN chromosome's fitness, which in this case is the validation accuracy. The system then selects a predefined number of fittest validated CNN chromosomes and saves them in a list that includes the GA Elitism chromosomes [35]. It mitigates genetic drift by ensuring that the right chromosomes pass on their characteristics across generations. This technique enables the GA to rapidly converge [36]. The GA selection procedure is based on the "Roulette Wheel" strategy of selection [37]. A two-dimensional array is constructed that includes the index of each chromosome, its fitness value, and its probability of selection value. Various operations are performed on the list during the framework process, such as inserting or deleting individuals based on their fitness, as the chromosomes may be substituted by the fittest component of each generation. Additionally, the list is iteratively sorted after each generation. The individual's fitness function f i is determined as follows: where TP "True Positives" is the class instances number that is recognized correctly, TN "True Negatives" is the class instances number that is recognized correctly which do not belong to the class, "False Positives" FP is the class instances number that the instances were mistakenly assigned to the class, and "False Negatives" FN is the class instances number that the instances were not recognized within the class instances.
To produce new offspring, the mechanism employs crossover and mutation at predefined rates that are initialized at the algorithm's outset. The framework must expand its search space to include different regions in order to increase the consistency of solutions, prevent premature convergence, and maintain chromosome diversity; the framework then enters a loop to verify chromosome similarity. The following two Equations describe the relation of two CNN chromosomes c i and c j : These similarity equations are influenced by the one used in [38], but as previously mentioned, the suggested encoding scheme relies on blocks to build the CNN chromosome, and each block has its own layers. Thus, we must first evaluate the number of blocks n B in the c i and c j to ensure that the blocks are identical in size, and then the number of layers n L inside each block to determine the size of the specified block in one chromosome B i k compared to the corresponding block in the other chromosome B j k . If they are of similar scale, the system verifies that they share common components such as the ReLU activation function, batch normalization, and dropout units. The suggested framework prevents duplications by repairing the population of CNN entities by mutation.
The new generation is created by combining the existing generation's descendants with the elite list. Prior to progressing to the next generation, the framework saves the CNN individual with the highest validity accuracy in a list called "Top Global" T G list, which includes the Top-1 CNN architectures from each generation. The key loop iterates until the predefined limit number of generations has been reached. Each CNN in the TG list has a retraining step that optimizes the weights of these CNN architectures over a predefined number of epochs. The proposed framework's final step makes use of the stacked ensemble, in which each CNN individual with its qualified weights in the global list is combined into the stack ensemble model. To obtain the overall prediction accuracy, the ensemble model is learned and validated.

The Customized Stack Ensemble
Finally, we add a customized stacking ensemble to the suggested structure [39]. The stacking ensemble approach blends several first-level classifiers by feeding their outputs to a higher-level second-level classifier (meta-classifier). The meta-level classifier is regarded as the ensemble committee's master classifier. This committee is made up of a variety of base classifiers. Each member of the committee receives unique training in order to obtain varying degrees of classification accuracy. Thus, the best CNN architectures saved in the T G list are used as the base classifiers, while the meta-classifier is chosen to be a completely linked neural network that concatenates the output classification weights from each trained CNN base model's final layer. To obtain new validation accuracy, the meta-classifier is trained and tested on the benchmark dataset.

Architecture Building Elements
The lightweight CNN models rely heavily on convolutional modules, which have a small number of trainable parameters in comparison to the standard convolutional module. As a result, we proposed that the layers within the created CNN chromosome blocks be constructed using a modified squeeze fire module. The updated fire module is constructed using Depth-wise Separable Convolution (DSC fire Module) rather than Squeeze Net's initial fire module, which is constructed using natural convolution (NC fire Module). As opposed to the initial NC fire module, this module reduces the amount of parameters by 68.34 percent [40]. The number of parameters (P FIRE−DSC ) is calculated according to: where I denotes the number of layer input channels, O S denotes the number of channels of squeeze layer output, O E denotes the number of channels of the expand layer output, and S K is the size of the kernel. The skip connection is a structural feature [41]. It functions as shortcuts through the layers, allowing the system to bypass one or more layer. The following explanations justify the usage of skip connections in our work: (1) Skip connections mitigate the effect of vanishing gradients and allow the training of very deep models; (2) They simplify the model during the early stages of training, accelerating the learning process by reusing activations from previous layers [42]; (3) As in [43], skip connections resolve the singularities problem by breaking the neural network nodes permutation symmetries.

The Datasets
MNIST [44], CIFAR-10 [45], CIFAR-100 [45], and ImageNet [46] were used as benchmark datasets in this study. They are often used datasets by researchers to evaluate various machine learning and image recognition techniques. The significant feature of these datasets is that the item in the sample image often occupies a variety of positions and areas and is not consistent across images. Additionally, they require limited formatting and preprocessing steps.

Experimental Setup
In this subsection, we will demonstrate how to set up experiments, which is a critical part of reproducible research. In our case, the experiment configuration consists primarily of GA parameter settings (as shown in Tab. 1) and CNN training parameter settings (as shown in Tab. 2). To define the maximum number of generations and population size, we must strike a compromise between obtaining the optimal solution and minimizing the time required by GA to perform the search. We found that when the GA reaches 20 generations, there is no improvement; although this calculation does not often guarantee convergence, it is considered a reasonable trade-off for reducing the search time for the method. As in [47,48], the crossover and mutation frequencies are set at 0.9 and 0.03 respectively. For training parameters, we chose to train each produced CNN for 50 epochs with 128 sample batch sizes using the optimizer "Adam stochastic optimization" algorithm [49]. The CLR system is used, with a base learning rate of 0.001 and a maximum learning rate of 0.006. During training, the cutout data augmentation technique [50] is used to prevent overfitting by randomly erasing neighboring pixels in the images to be applied as changed data samples to the dataset. After the GA method is complete, the best architectures are retrained to optimize their weights for 500 epochs.

Experiments Environment
The proposed architecture is implemented in Python 3 and trained using the Keras framework with a TensorFlow backend. The computer machine used in the experiment has an Intel core-i7-8700K 3.7 GHz CPU and 16 GB of RAM, and all the produced CNN models are trained and validated on a single GPU with the model form "NVIDIA GeForce GTX 1080."

Experiments Results
This subsection would discuss the experimental results obtained for the proposed framework in order to evaluate its success under various configurations of usable architectural building components.

Result Analysis
We investigated the impact of four different configurations on the created CNN architectures in terms of performance validation accuracy and parameter number. Normal convolution (NC), depth-wise separable convolution (DSC), NC fire module, and DSC fire module are the four configurations where each configuration is designed to be the primary convolution module for the framework's CNN model generation. We replicated the experiments ten times on each design on the four datasets chosen to determine the degree of outcome uncertainty. Tab. 3 provides a statistical evaluation of the validation accuracy for the NC fire module configuration for the ImageNet dataset using these ten runs of twenty generations in terms of mean, median, standard deviation, minimum, and maximum. Fig. 2 illustrates the plot of ten runs of this configuration through twenty GA generations in terms of maximal validity accuracy. The plot depicts the search evolution mechanism for feasible validation accuracies in different runs of the proposed framework, which continues to converge over generations. The results in Tabs. 4 and 5 show the performance of the proposed model concerning validity accuracy (Val. Acc.) for the fewest CNN model trainable parameters (Param. #) for the four configurations (see Tab. 4) along with its GPU days needed and stack ensemble overall accuracy (See Tab. 5). As seen in Tab. 4, when we ran our experiments using the four configurations as the architecture building blocks inside the created CNN chromosomes, we discovered that using the NC fire module or DSC fire module results in an increase in model accuracy and a decrease in parameter number. As a general observation, the configuration that performs the highest in terms of validation accuracy is often the produced CNN models based on the NC Fire Module, while the configuration with the fewest parameters is the generated CNN models based on DSC convolution. Meanwhile, we see that the DSC fire module configuration is superior in these two respects because it enables us to achieve an acceptable level of validation precision with a small number of model trainable parameters, which has a direct effect on the time required for the proposed framework's GA search operation. The results indicate that the number of parameters in several models, especially DSC-based CNN models, is less than one million trainable parameters, except for ImageNet.   We observed that, on average, the fire modules deepen the produced models without significantly increasing the number of parameters. As shown in Tab. 5, when applied to different datasets, the customized stack ensemble technique improves overall validation accuracy by 0.4 percent to 1.7 percent in the case of the CIFAR-10 dataset and by 0.3% to 1.6% in the case of the CIFAR-100 dataset. In MNIST, the average validity performance improves by 0.09 to 0.18% following stack ensemble. When compared to other datasets, the increase in accuracy due to the stack ensemble process is small in the case of MNIST. This is because the top CNN models trained on MNIST reached nearly slope accuracy prior to entering the stack ensemble process. From a time perspective, the system completed the process in a range of 4.28 to 6.3 GPU days for the three datasets, which is deemed low in comparison to similar work.

Comparison with Related Work
To verify our proposed methodology, we compared it to other related work approaches that make use of the same benchmark datasets using the validation accuracy metric. Additionally, since we concentrated on lightweight and resource-constrained models, we compared the model trainable parameters (Param. #), the GPU days, and the amount (no.) of GPUs used, as seen in Tabs. 6-9. In Tab. 6, we test the proposed framework on the MNIST dataset against EDEN [25]. The suggested framework achieved 1.182% more than their validation accuracy; however, they have a smaller number of parameters for their model and fewer GPU days. As seen in Tab. 7, our validation accuracy beats the best-related peer competitor CNN-GA by 0.02%. While this is a tiny amount, we achieve this precision by reducing the number of parameters by 41% in a CNN model of 1.9 million parameters. As seen in Tab. 8, we achieved a 0.6% increase over CNN-GA by reducing the number of parameters by 5%. In the case of ImageNet in Tab. 9, the ideal configuration improves validation accuracy by 13.2% as opposed to the automated solution EAT-Net and reduces the number of parameters by 1.98%. Meanwhile, in the hand-crafted domain, the proposed model outperforms EfficientNet by 1.13% in accuracy and significantly reduces the number of parameters.

Conclusions and Future Work
In this article, we suggested a method for finding lightweight CNN models that is built on a genetic algorithm. To create and reflect CNN models, the proposed architecture employs a novel encoding process. This encoding method is used by the framework search process to describe the created CNN models as solutions in order to construct the solution search space. Validation of the system is performed using a variety of image benchmark datasets. It outperformed competitors in terms of validation precision, GPU days, and model parameter count. Additionally, a stack ensemble approach was adapted for our challenge, and the experiments demonstrate that it outperformed the single best-generated model. Future research would concentrate on reducing the amount of time spent on the search method by proposing and implementing increasingly sophisticated search algorithms. These search algorithms will incorporate multiple-objective fitness requirements in order to handle various facets of the CNN architecture. Another potential enhancement is the addition of new layer elements such as Long-Short Time Memory units or some other kind of layer element capable of accommodating non-sequential models in order to provide a more flexible architecture capable of handling any configuration topology.

Funding Statement:
The authors received no specific funding for this study.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.