1 Introduction

In recent years, the problem of supervised learning with multilayer perceptron neural networks (MLPNN) has received a lot of attention in the research community. Various approaches have been proposed to solve this problem, including classical learning [1], optimization methods [2], and their effectiveness has been widely discussed.

Developing an effective neural network is a complex and essential task in machine learning and artificial intelligence [3]. It involves the intricate selection of the network’s architecture and training configurations to achieve optimal performance. These decisions are guided by a set of hyperparameters, which are additional parameters that play a crucial role in defining the network’s behavior but cannot be learned directly from the training data [4]. Hyperparameters encompass a wide range of settings, including the number of hidden layers, learning rates, batch sizes, regularization parameters, activation functions, and more [5]. Properly tuning these hyperparameters is vital for achieving superior neural network performance [6]. The impact of hyperparameters on neural network performance is well-documented in the literature, and optimizing them is essential for achieving top-tier performance [3, 7, 8]. Selecting optimal values for hyperparameters is a critical aspect of developing a high-performing neural network, as it can significantly impact the network’s accuracy, generalization ability, and training speed [9]. To address the challenge of hyperparameter optimization, several approaches have been proposed, including grid search [10], random search [11], Bayesian optimization [12], and genetic algorithms [13]. While some of these approaches have proven effective in certain cases, they can be computationally expensive and may not be suitable for optimizing real-valued hyperparameters and large-scale neural networks [11, 14, 15].

Due to its robust capabilities in tackling real-world optimization problems, the real-coded genetic algorithm (RCGA) has established itself as a highly effective and commonly utilized evolutionary algorithm (EA). Recent literature has presented numerous successful applications of RCGAs across diverse fields [16,17,18,19,20], underscoring its prominence and relevance in contemporary optimization practices. Drawing inspiration from the principles of natural selection and the survival of the fittest observed in the biological world, real-coded genetic algorithms (RCGAs) essentially operate as population-based stochastic search schemes. They incorporate a structured sequence of selection, crossover, and mutation operators to explore and optimize solutions. Over the past few decades, significant progress has been made to enhance the solution efficiency of RCGAs. These advancements and dedicated efforts have led to the categorization of previous developments and attempts based on the mechanisms and techniques employed. In a recent study conducted by Chuang and colleagues [21], a comprehensive survey was undertaken to explore the advancements in crossover operators [22,23,24,25,26,27,28]. Their research findings revealed that two predominant schemes, namely line segment connection and distribution analysis of parents, have been widely employed in the development of novel crossover operators. However, it was observed that these crossover operators face challenges when dealing with highly demanding optimization problems. Specifically, these operations may struggle to generate viable offspring in regions of ambiguity, especially in scenarios where the population size is relatively small in comparison to the overall search space or when the initial population distribution lacks uniformity across the permissible domain [21]. Furthermore, the designed crossover operators may encounter difficulties in locating the global optimum, particularly when the true solution is in close proximity to or situated along the boundaries of the feasible search space [23]. To address the challenges associated with crossover operators, particularly in cases where optimization problems are subject to stringent conditions, this study introduces a novel approach that employs the Simulated Binary Crossover (SBX) operator [29]. SBX is designed to facilitate efficient exploration of the search space, even in the presence of complex constraints [30]. Unlike conventional crossover operators that rely on fixed or random step sizes for recombination, SBX takes an adaptive approach. It adjusts the step size dynamically based on the relative fitness information obtained from the objective function and the prevailing constraint violation status [31]. This adaptability allows SBX to effectively explore the continuous search space for real-valued parameters, enhancing its ability to efficiently address complex optimization tasks. In addition to the utilization of the Simulated Binary Crossover (SBX) operator, this paper also leverages the benefits of the polynomial mutation operator [32] to further enhance the optimization process. Polynomial mutation serves as a valuable complement to the crossover operator by introducing additional diversity and exploration capability in the population [33]. This mutation operator operates by perturbing individual gene values, thus enabling the algorithm to explore uncharted regions of the solution space.

The application of these genetic operators is tailored to the optimization of the MLPRGA+5 model, which serves as a comprehensive approach to configuring Multilayer Perceptron neural networks. By optimizing both the network’s architecture and the associated hyperparameters, which encompass variables like learning rates, activation functions, and regularization hyperparameters, this algorithm efficiently explores complex optimization landscapes. This multi-faceted approach enhances the algorithm’s capacity to fine-tune hyperparameters and optimize neural network architectures effectively, resulting in superior performance when applied to tasks such as classification.

To evaluate the proposed approach, the study uses four datasets from the UCI repository of machine learning databases [34]: "Iris," "Hypothyroid," "breast cancer," and "Wine". The MLPRGA+5 approach showcases its efficiency through numerical and manual analysis, demonstrating its exceptional performance in both complexity reduction and accuracy enhancement. The main objectives of this paper are given as follows:

  • Provide a new optimization model for MLP Hyperparameters tuning.

  • solve the proposed model using a real-coded genetic algorithm

  • comparison of the proposed approach with existing MLP-based models

The paper is structured as follows: In Sect. 2, we recommended several existing papers that address the optimization of hyperparameters in neural networks and provided contextual information on various methods used for training MLPNNs after establishing the topic under research in Sect. 1. We will provide a new optimization Model for MLP Hyperparameters Tuning in Sect. 3, In addition, we provide a detailed description of the proposed model, including the argumentation for the objective function and the proposed constraints. Section 4 is dedicated to the Real-Coded Genetic Algorithm (RCGA) solver for MLP, which will be adapted to fit the proposed model MLPRGA+5 in Sect. 3. Finally, Sect. 5 outlines experiments and results before concluding in Sect. 6.

2 Related Works

Multilayer Perceptron (MLP) model weight and parameter tuning is a critical machine learning task as it directly impacts the model’s performance. The architecture and hyperparameters of MLPs have been proposed to be optimized in recent years utilizing a variety of optimization techniques, including gradient descent, random search, and Bayesian optimization. However, due to their effectiveness in handling high-dimensional and non-linear search spaces, genetic algorithms (GA) have attracted interest. In terms of model accuracy and generalization, the use of GA in optimizing MLP designs and hyperparameters has produced encouraging results. In this section of related work, we evaluate the most recent results in MLP model optimization using genetic algorithms and discuss the advantages and limitations.

The aim of MLPNN training is to get at an optimal objective function that best represents the weights selection, which can be thought of as an optimization problem. Once the optimization model has been created, a number of alternative algorithms can be applied. Traditional methods like back-propagation (BP) have previously been used for MLPNN training [35]. However, it has been shown that the optimization of MLP parameters and hyperparameters is effective when utilizing complex optimization methods such as genetic algorithms (GA). The advantage of GA is that it uses parallel search processes to investigate a large search space in search of the optimum solution, reducing the possibility of getting stuck at a local minimum. The use of genetic algorithms to improve the weights of MLP models has been the subject of numerous studies [36]. Wang et al. [37] use the GA approach to optimize the initial weights and biases of the back-propagation (BP) neural network in order to anticipate the bending force in the hot strip. Jasmeen Gill et al. [38] Use back-propagation (BP) and genetic algorithms (GA), for forecasting the weather the experiment uses daily weather data. Saremi et al. [39] improved the weights of an MLP model for speech emotion identification using a hybrid GA and BP, and she did it with greater accuracy than she could have with BP alone. MLP weight optimization has also been done using other optimization methods, such as a hybrid GA with simulated annealing (GA-SA) and a hybrid GA with PSO, both of which performed better than using GA or the other algorithm alone.

Optimizing architecture and hyperparameters of MLP models can also be seen as an optimization problem [3]. Multilayer-perceptron (MLP) neural networks’ performance has been optimized in several research by changing various architecture parameters and hyperparameters. Architecture parameters, such as the number of hidden layers, the number of neurons per hidden layer, and the type of activation function, describe the structure of the MLP. Sam Ansari et. [40] utilized a genetic algorithm to find the best parameter values for a multi-layer perceptron utilizing different chromosomal coding techniques, while Jenny Domashova et al. [41] used it to identify the best architecture for precisely classifying situations. Sreedharan et al. [42]combines a Multi-Layer Perceptron (MLP) neural network with a Genetic Algorithm (GA) to predict financial distress in businesses. The GA optimizes hyperparameters like hidden layers, neuron counts, and activation functions, improving the MLP’s performance. This hybrid approach offers an effective method for early financial distress prediction, aiding stakeholders in risk assessment decisions. Daviran et al. [43] focuses on landslide susceptibility prediction and employs Artificial Neural Networks (ANN). The unique aspect of the study is the use of a Genetic Optimization Algorithm to fine-tune the hyperparameters of these models. This study by Kumar et al. [44] focuses on optimizing the hyperparameters of Deep Neural Networks (DNN) through a two-step genetic approach. The research introduces a novel method to fine-tune hyperparameters, aiming to enhance the performance of DNNs. The twofold genetic approach involves the use of Genetic Algorithms (GA) to explore hyperparameter combinations and Genetic Programming (GP) to evolve neural architectures. This innovative approach provides a comprehensive and effective strategy for optimizing DNNs, ultimately improving their performance in various applications. Mohan et al. [45] introduces an innovative automated superlearner that combines multiple machine learning models for improved predictive accuracy. This superlearner is optimized using a Genetic Algorithm-based hyperparameter tuning approach. By leveraging Genetic Algorithms, the study seeks to find the best hyperparameters for each model within the superlearner, leading to enhanced predictive performance. This research presents a novel method for automated model selection and hyperparameter optimization, contributing to more accurate predictions in various applications. Abdollahiet al. [46] focuses on predicting diabetes using a hybrid stacked ensemble approach. GAs are utilized to optimize the hyperparameters of the ensemble, resulting in a more accurate diabetes prediction model. The study showcases the effectiveness of combining ensemble methods with GA-based hyperparameter tuning for medical diagnosis tasks, particularly in the context of diabetes prediction. By optimizing hyperparameters, the research aims to enhance the accuracy of landslide susceptibility predictions. Hong Wang et al. [47] improved the architecture of an artificial neural network for computing slope stability safety factors. Tayebi et al. [48] analyzes the performance of metaheuristic algorithms in optimizing hyperparameters for fraud transaction detection models. It examines the efficacy of various metaheuristic techniques for improving the accuracy of fraud detection systems. The study offers valuable insights into the application of metaheuristics for hyperparameter optimization in the context of fraud prevention. This study employed by Arukonda et al. [49] introduces a novel ensemble approach for disease diagnosis that emphasizes diversity among models. It incorporates Genetic Algorithms (GAs) to enhance diversity and optimize the ensemble’s performance, leading to effective disease diagnosis. The research focuses on improving diagnostic accuracy by harnessing the strengths of diverse models and GA-based hyperparameter tuning, offering a promising strategy for disease diagnosis. In order to assess indoor environmental conditions in real-time, Miguel Martinez-Comesa et al. [50] recommended employing optimized MLP neural networks. He used the multiobjective genetic algorithm NSGA-II to determine the architecture with the lowest error and complexity. Furthermore, by adding a decision variable to each layer, researchers have created a mathematical model to optimize hidden layers [2, 51]. The MLP model’s learning parameters, such as learning rate, momentum, activation function, and regularization parameters, are referred to as hyperparameters. To optimize hyperparameters, many conventional methods have been applied, including grid search and random search. The hyperparameters of MLP models have been optimized using more sophisticated optimization methods, such as genetic algorithms, PSO, and Bayesian optimization, which have improved generalization and increased accuracy.

Li et al. [52], for example, introduced a hybrid GA and PSO algorithm to optimize the hyperparameters of MLP models for financial time series forecasting, outperforming grid search and random search in terms of performance. In comparison to grid search and random search, Wang et al. [53] use Bayesian optimization to optimize the hyperparameters of MLP models for breast cancer diagnosis producing greater accuracy. Additionally, Zhang et al. [54] used an adaptive GA to outperform various cutting-edge optimization strategies in optimizing the hyperparameters of MLP models for multi-label classification tasks. Particle swarm optimization (PSO) [55], differential evolution (DE) [56], and simulated annealing (SA) [57]are further optimization methods that have been used to MLP models.

3 A new Optimization Model for MLP Hyperparameter Tuning

Hyperparameter tuning task in MLP involves solving optimization problems. When constructing an MLP model, the weight parameters are initialized and iteratively optimized using various optimization methods until the objective function reaches a minimum value or the accuracy reaches a maximum value [3]. Similarly, hyperparameter optimization methods focus on optimizing the model’s architecture by identifying the most suitable hyperparameter configurations. In this section, we introduce the core principles of mathematical optimization and hyperparameter optimization, along with the proposed approach.

3.1 Hyperparameter Tuning in MLPS

In order to develop machine learning models that use hyperparameter optimization, it is necessary to explore the space of possible hyperparameter values to identify the parameters that make the model work best. By employing optimization techniques, we may successfully navigate this search space and find the hyperparameters that lead to the model operating at its most effective level. The four main components of hyperparameter optimization are an estimator with an objective function, a search space, an optimization method, and an evaluation function. These components work together to determine the best set of hyperparameters for the model.

In mathematics, the term "optimization" refers to the process of selecting the best course of action among a variety of feasible options in order to maximize or reduce the objective function [58]. The unconstrained optimization problem is indicated by [59]:

$$\begin{aligned} \min _{X \in {\mathbb {R}}} f(X) \end{aligned}$$
(1)

Contrarily, constrained optimization can be expressed as [59]:

$$\begin{aligned} \left\{ \begin{array}{l} \min f(X) \\ \text { Subject to: } \\ g_i(X) \le 0, \quad \textrm{i}=1, \ldots , \textrm{m} \\ \textrm{h}_{\textrm{i}}(X)=0, \quad \textrm{j}=1, \ldots , \textrm{p} \end{array}\right. \end{aligned}$$
(2)

where f(X) is the problem objective function, \(\textrm{X}\) is the domain of \(X, \mathrm {~g}_{\textrm{i}}(X) \le 0, \textrm{i}=1, \ldots , \textrm{m}\), and \(\textrm{h}_{\textrm{i}}(X)=0, j=1, \ldots , \textrm{p}\), are inequality and equality constrained functions, respectively. The feasible domain \(\textrm{D}\) of \(\textrm{X}\) is as follows:

$$\begin{aligned} \textrm{D}=\left\{ X \in \textrm{X} \backslash \textrm{g}_{\textrm{i}}(X) \le 0, \mathrm {~h}_{\textrm{i}}(X)=0\right\} \end{aligned}$$
(3)

The objective of a hyper-parameter optimization task is to get [60]:

$$\begin{aligned} X^*=\arg \min _{X \in {\mathbb {R}}} f(X) \end{aligned}$$
(4)

The best prediction model \(f^*\) can be determined by using [61]:

$$\begin{aligned} f^*=\arg \min _{f \in F} \frac{1}{n} \sum _{i=1}^n L\left( f\left( \textrm{x}_{\textrm{i}}\right) , \textrm{y}_{\textrm{i}}\right) \end{aligned}$$
(5)

where \(\textrm{L}\) is the cost function value for each sample, \(\textrm{n}\) is the number of training data points, \(\textrm{x}_{\textrm{i}}\) is the feature vector of the i-th instance, \(y_i\) is the associated output. In supervised learning algorithms, there are numerous distinct loss functions, such as the square of Euclidean distance, cross-entropy, information gain, etc. [61].

3.2 Proposed Approach

Machine learning models aim to optimize performance by finding the best values for parameters that affect the output. One key type of parameter is weight parameters, which are initially set and then adjusted using an optimization technique until the model achieves a high level of accuracy or reaches a critical point in its objective function.

In order to create an effective machine-learning model, it is important to strike a balance between accuracy and complexity. This can be achieved by using hyperparameter optimization to find the optimal values for the model’s hyperparameters. By optimizing hyperparameters, we can ensure that the model is both accurate and not overly complex, resulting in a model that performs well on both training and testing data. This work introduces a novel mathematical programming approach for optimizing neural networks, which can be seen as a mixed-constraint non-linear programming model. This model possesses a unique capability - it allows us to introduce new variables into the objective function. These variables serve as a mechanism for optimizing hyperparameters within a Multilayer Perceptron (MLP). Notably, these variables also act as a pruning technique. Their role is to systematically identify and eliminate any hidden layers within the MLP that may not be contributing effectively to the model’s performance. Additionally, they facilitate adjustments to the model’s hyperparameters, enabling fine-tuning.

The application of this approach holds significant promise. By eliminating unnecessary hidden layers and optimizing hyperparameters, we effectively reduce the model’s complexity. This streamlined structure, in turn, leads to a notable reduction in the number of parameters (or "weight loss"), resulting in a more efficient and computationally lightweight model. The approach introduces certain constraints that facilitate communication between different levels of the neural network, aiding in the optimization process. These constraints are an integral part of the model, serving to guide the optimization towards achieving excellent results.

This paper presents a new optimization model, in order to:

  • To optimize the architecture parameters of the neural network.

  • Combining hyperparameter tuning with architecture optimization.

The following parts will provide an explanation of the suggested model.

3.2.1 Notation

In order to model the issue of neural hyperparameters optimization, we had to construct some notation:

  • \(\textrm{X} \quad \): Set of inputs us explicative variable of a model.

  • N: Number of hidden layers

  • \(\textrm{L} \quad \): Total number of hyperparameters

  • \(\textrm{n}_0\): Number of neurons in the input layer

  • \(n_i\): Denotes the number of nodes in the layer i, for \(i=1 \ldots N\).

  • \(\textrm{n}_{o}\): Number of neurons in output layer.

  • \(\textrm{n}_{\textrm{opt}}\): Optimal number of hidden layers

  • \(h_i \quad \): The output of hidden layer, for \(i=1 \ldots N\).

  • f: Activation function

  • O: Optimizer

  • \(\mathcal {\alpha } \quad \): Learning rate

  • \(\mathcal {\lambda } \quad \): Regularisation parameter

  • Y: The predicted output of the model

  • d: The desired output

  • F: Transfer function of ANN

  • W: The weights of the network with:

    $$\begin{aligned} W=\left( w_{k, j}^i\right) _{\begin{array}{c} 0 \le i \le N \\ 1 \le k \le n_i \\ 1 \le j \le n_{i+1} \end{array}}^{\textrm{i}} \quad , \quad w_{k, j}^{\textrm{i}} \in {\mathbb {R}} \end{aligned}$$
  • H: Denotes the search space of hyperparameters (configuration space). \({{H}_{i}}\left\{ \begin{array}{l} \textrm{n}_{\textrm{i}} \text { if } 1 \leqslant i \leqslant N \\ \{f, O, \alpha , \lambda \} \text { if } N+1 \leqslant i \leqslant L \end{array}\right. \)

  • \(\textrm{c}_{\textrm{n}_{\textrm{i}}} \quad \): Binary variable for \(i=1 \ldots N\), as: \(\textrm{c}_{\textrm{n}_{\textrm{i}}}\left\{ \begin{array}{ll} 1 &{} \text { if the number of randomly generated neurons } \textrm{n}_{\textrm{i}} \\ &{}\qquad \text { associated with layer } \textrm{i} \text { is greater than } 0 \\ 0 &{} \text { otherwise } \end{array}\right. \)

A multilayer perceptron performs a transformation of the input variables:

$$\begin{aligned} \textrm{Y}=\textrm{F}(\textrm{X}; \textrm{W}; \textrm{H})=\left( \textrm{y}_1, \textrm{y}_2, \ldots , \textrm{y}_{\textrm{n}_{\textrm{N}+1}}\right) \end{aligned}$$
(6)

where \(\textrm{H}=\left( \textrm{H}_\textrm{1}, \ldots , \textrm{H}_{\textrm{L}}\right) \in {\mathbb {R}}_{+}^{L} \text { and } \textrm{C}=\left( \textrm{c}_{\textrm{n}_1}, \ldots , \textrm{c}_{\textrm{n}_{\textrm{N}-1}}\right) \in \{0,1\}^{N-1} \) the result of the hidden layer \(h_1\), is calculated by the following term and represented in Fig. 1

$$\begin{aligned} h_1=\left( \begin{array}{c} h_1^1 \\ \vdots \\ \vdots \\ h_1^j \\ \vdots \\ \vdots \\ h_1 \end{array}\right) =\left( \begin{array}{c} f\left( \sum _{\textrm{k}=1}^{n_0} w_{k, 1}^0 x_k\right) \\ \vdots \\ f\left( \sum _{\textrm{k}=1}^{n_0} w_{\textrm{k}, \textrm{j}}^0 x_k\right) \\ \vdots \\ f\left( \sum _{\textrm{k}=1}^{n_0} w_{\textrm{k}, \textrm{n}_1}^0 x_k\right) \end{array}\right) \end{aligned}$$
(7)

where \(\left( \textrm{x}_1, \textrm{x}_2 \ldots \textrm{x}_{\textrm{n}_0}\right) \) are the inputs of neural networks.

Fig. 1
figure 1

Output of the first hidden layer

The result of the hidden layer \(h_i\), is calculated by the following term and represented in Fig. 2:

$$\begin{aligned} h_i=\left( \begin{array}{c} h_i^1 \\ \vdots \\ \vdots \\ h_i^j \\ \vdots \\ \vdots \\ h_i^{n_i} \end{array}\right) =h_{i-1}\left( 1-\prod _{k=1}^i c_{n_k}\right) +\prod _{k=1}^i c_{n_k} \cdot \left( \begin{array}{c} f\left( \sum _{k=1}^{n_{i-1}} w_{k, 1}^{i-1} h_{i-1}^k\right) \\ \vdots \\ \vdots \\ f\left( \sum _{k=1}^{n_{i-1}} w_{k, j}^{i-1} h_{i-1}^k\right. \\ \vdots \\ \vdots \\ f\left( \sum _{k=1}^{n_{i-1}} w_{k, n_i}^{i-1} h_{i-1}^k\right) \end{array}\right) \end{aligned}$$
(8)

where \(\textrm{i}=2 \ldots \textrm{N}-1\) the result of the hidden layer \(h_N\), is calculated by the following term and represented in Fig. 3

$$\begin{aligned} h_N= & {} \left( \begin{array}{c} h_N^1 \\ \vdots \\ \vdots \\ h_N^j \\ \vdots \\ \vdots \\ h_N^{n_N} \end{array}\right) =h_{N-1}\left( 1-\prod _{k=1}^N c_{n_k}\right) +\prod _{k=1}^N c_{n_k} \cdot \left( \begin{array}{c} f\left( \sum _{k=1}^{n_{N-1}} w_{k, 1}^{N-1} h_{N-1}^k\right) \\ \vdots \\ \vdots \\ \sum _{k=1}^{n_{N-1}} w_{k, j}^{N-1} h_{N-1}^k \\ \vdots \\ \vdots \\ f\left( \sum _{k=1}^{n_{N-1}} w_{k, n_N}^{N-1} h_{N-1}^k\right. \end{array}\right) \end{aligned}$$
(9)
$$\begin{aligned} y_i= & {} f\left( \sum _{k=1}^{n_N} w_{k i}^N h_N^k\right) \quad j=1 \ldots n_{N+1} \end{aligned}$$
(10)

With \(y_i\) as the output of the neural network

Fig. 2
figure 2

Output of the ith hidden layer

Fig. 3
figure 3

Output of the ANN

3.2.2 Objective Function

The objective function utilized in the proposed model is the mean squared error (MSE), which is a widely-used error function for training neural networks. The MSE measures the average squared difference between the predicted output and the actual output. It is computed as the sum of the squared differences over all training examples in the example set:

$$\begin{aligned} Loss=\frac{1}{2J} \sum _{j=1}^J\left( F_j(X; W;H)-d_j\right) ^2 \end{aligned}$$
(11)

In the formula:

  • \(\textrm{J}\): represents the size of the example set, indicating the number of training examples.

  • \(F_j(X; W; H)\): denotes the predicted output based on the input X, the model weights W, and the hyperparameters H.

  • \(d_j\): represents the actual output for the corresponding training example.

  • \(\sum \): denotes the sum over all training examples \(j=1\) to J.

By calculating the squared differences between the predicted and actual outputs for each training example and summing them up, we obtain the MSE. This error function quantifies how well the model approximates the desired output for the given training examples.

The utilization of the MSE as the objective function allows the model to optimize its parameters and hyperparameters to minimize the discrepancy between the predicted and actual outputs. This optimization process aims to enhance the model’s performance and improve its ability to generalize to unseen data.

3.2.3 Constraints and Proposed Model

To prevent the complete destruction of the hidden layer and ensure that there is at least one hidden layer, we can ensure that the first parameter is always greater than zero. This latter is stated as follows:

$$\begin{aligned} \textrm{c}_{\textrm{n}_1}=1 \end{aligned}$$
(12)

Setting the lower and upper boundaries of the hyperparameters’ range of values is crucial to guaranteeing the algorithm produces valid results. This limits the range of numbers the algorithm can consider and stops it from coming up with solutions outside of the acceptable range. We avoid unimportant portions of the search space and set the lower bound to the smallest value and the upper bound to the highest value.

  • To define the range of float values for a hidden layer, we start by determining the minimum and maximum values it can take. Specifically, we assign the interval \(\left[ n_{i \min }, n_{\textrm{i} \max }\right] \) to the hidden layer.

    $$\begin{aligned} \textrm{n}_{\textrm{i} \min } \le \textrm{n}_{\textrm{i}} \le \textrm{n}_{\textrm{imax}} \quad \forall \textrm{i}=2 \ldots \textrm{N} \end{aligned}$$
    (13)
  • For a given hyperparameter, such as \(\textrm{n}_{\textrm{i}}\), we can assign a valid range \(\left[ 0, \textrm{n}_{\textrm{i} \text { max }}\right] \) to define the values that can be optimized.

    $$\begin{aligned} 0 \le \textrm{n}_{\textrm{i}} \le \textrm{n}_{\textrm{i} \max } \quad \forall \textrm{i}=\textrm{N}+1 \ldots \textrm{L} \end{aligned}$$
    (14)

The concept of optimizing neural architecture can be represented by the following model:

$$\begin{aligned} \left\{ \begin{array}{l} \begin{aligned} &{} {\text {Min}} \frac{1}{2J} \sum _{j=1}^J\left( F_j(X; W;H)-d_j\right) ^2 \\ &{} \text { Subject to }: \\ &{} \textrm{H}=\left( \textrm{H}_{\textrm{1}}, \ldots , \textrm{H}_{\textrm{L}}\right) \in {\mathbb {R}}^{L} \\ &{} \textrm{C}=\left( \textrm{c}_{\textrm{n}_1}, \ldots , \textrm{c}_{\textrm{n}_{\textrm{N}-1}}\right) \in \{0,1\}^{N-1}\\ &{} \textrm{c}_{\textrm{n}_1}=1 \text { Or } \sum _{i=2}^N \textrm{c}_{\textrm{n}_{\textrm{i}}} \ge 1 \\ &{} \textrm{n}_{\textrm{i} \min } \le \textrm{n}_{\textrm{i}} \le \textrm{n}_{\textrm{i} \max } \quad \forall \textrm{i}=2 \ldots \textrm{N} \\ &{} 0 \le \textrm{n}_{\textrm{i}} \le \textrm{n}_{\textrm{i} \max } \quad \forall \textrm{i}=\textrm{N}+1 \ldots \textrm{L} \\ &{} \textrm{W}=\left( \textrm{w}_{\textrm{k}, \textrm{j}}^{\textrm{i}}\right) \quad \begin{array}{c} 1 \le \textrm{k} \le \textrm{n}_{\textrm{i}} \\ 1 \le \textrm{j} \le \textrm{n}_{\textrm{i}+1} \end{array} \quad \text { Where } \quad \textrm{w}_{\textrm{kj}}^{\textrm{i}} \in {\mathbb {R}} \\ \end{aligned} \end{array}\right. \end{aligned}$$
(15)

The optimal number of hidden layers:

$$\begin{aligned} c_{\text {opt }}=\sum _{i=1}^{N-1} c_{n_i} \end{aligned}$$
(16)

During the training phase, we employ a randomized and sequential data input strategy. This approach ensures that each data point enters the network without any predetermined order or bias, contributing to a comprehensive learning experience. However, the true strength of our approach lies in addressing two critical aspects: the determination of the optimal number of hidden layers and the configuration of weights. Deciding on the right number of hidden layers significantly influences the network’s ability to model complex data relationships. Simultaneously, weight configurations govern the flow of information within the network, making them pivotal in fine-tuning network responses. Additionally, the choice of an appropriate training period is essential, as it dictates how long the neural network learns from data. The training duration should align with the network’s complexity, enabling weights to adapt and optimize their responses according to the problem’s intricacies. Our approach, therefore, encompasses both architectural considerations and temporal dynamics, with a focus on optimizing the neural network’s potential.

Over time, various methods and strategies have been proposed to tackle the optimization of hyperparameters in Multilayer Perceptron (MLP) training. In this context, we advocate for the utilization of Real-Parameter Genetic Algorithms (RGAs) to directly manage real-valued variables, as opposed to an initial conversion to binary strings, within the described model.

4 Real Coded GA for Proposed Approach: MLPRGA+5

Genetic algorithms (GAs) are search algorithms that are inspired by the principles of natural genetic populations to find solutions to a wide range of problems. Real-coded genetic algorithms (RCGAs) are a type of optimization algorithm that uses real-valued vectors to represent individuals in the population. However, in optimization problems involving continuous domain variables, it is more intuitive to represent genes directly as real numbers. This choice aligns the solutions closely with their natural formulation, effectively eliminating distinctions between the genotype (coding) and phenotype (search space). The utilization of real coding was initially introduced in specific applications, such as in the work of Lucasius et al. [62] for chemometric problems, Davis [63] for the utilization of meta-operators in finding optimal parameters for standard genetic algorithms, and in [64] for numerical optimization in continuous domains. For a comprehensive overview of real-coded genetic algorithms, one can refer to [65,66,67].

As mentioned earlier, real coding proves to be the most suitable encoding method for optimization problems involving parameters in continuous domains. Given our specific focus on optimizing hyperparameters for MLP neural networks, the utilization of real coding and its associated genetic operators becomes a logical choice. In our study, we have not employed binary (ordinary) genetic algorithms for this purpose. One of the primary advantages of using real-value encoding over binary encoding is the enhanced precision it offers. Binary coding, when applied to real-valued numbers, can lead to a loss of precision, particularly when a limited number of bits is used to represent each value [68].

Furthermore, real-value encoding results in notably shorter chromosome strings. This efficiency is particularly significant when optimizing hyperparameters for MLP networks. Our aim is to find an optimal set of hyperparameters that enables the MLP neural network to perform with high accuracy and minimal complexity. In this context, a chromosome or genotype encompasses all the hyperparameters, with each gene representing an individual hyperparameter value.

Below, we present the chromosome representations along with the developed selection, crossover, and mutation operators tailored to this encoding.

4.1 MLPRGA+5:Representation and Initialisation

An individual or chromosome is a vector of genes represented by floating point values. A choice variable is represented by each gene. From a predetermined bounded solution space, each chromosome indicates a randomly generated response (feasible region). The MLPRGA+5 optimization process involves two key sets of decision variables: the model structure, which is a major determinant of performance and can take on values between \(\textrm{n}_{\textrm{i} \min }\) and \(\textrm{n}_{\textrm{i} \max }\), and the model hyperparameters, which also influence performance and can range between \(n_j\) min and \(n_{j \text { min}}\). A typical chromosome structure for optimizing model accuracy and cost under constraints is illustrated in Fig. 4.

Fig. 4
figure 4

Chromosome structure

The initial population is generated at random, the weights are given random values between [0, 1], and it consists entirely of the individuals who will be subjected to the genetic operators and evaluated by the fitness function. After the initial population is created, each individual is evaluated and assigned a fitness value based on the fitness function.

4.2 MLPRGA+5:Fitness and Selection

An individual’s extent of extemal adaptation can be determined using a statistic that is produced by a fitness function. This statistic aims to concentrate on the search for characteristics that will make an individual more adaptable or one who performs a position more effectively. The fitness in our study is indicated by the classification task accuracy rate. The percentage of all instances that were properly classified is known as accuracy. The fitness calculation method we propose is as follows:

$$\begin{aligned}{} & {} \textrm{F}(\textrm{i})=\text { Fitness }(\textrm{i})=\text { Accuracy }(\textrm{i}) \end{aligned}$$
(17)
$$\begin{aligned}{} & {} \text { Accuracy }=\frac{N_T}{N_d} \end{aligned}$$
(18)

Where \(\left( N_T\right) \) and \(\left( N_d\right) \) stand for, respectively, the number of examples that were successfully classified and the total number of examples in a dataset.

In this study, we employ the two-individual tournament selection technique, as shown in Fig. 5. Two individuals are selected at random from the community, and the fitness function values are then contrasted to determine which of them has the highest value. This procedure was repeated for a second random pick of two individuals, and a winner was also chosen. The cross-over operator receives these two selected victors after that. We continue to use the elitist approach, which ensures that the best individuals are always passed on to the next generation unchanged.

Fig. 5
figure 5

Tournament selection

4.3 MLPRGA+5:Reproduction Operators

Crossover is carried out using simulated binary crossover [29]. The polynomial mutation is used for mutation [32]. The following is a discussion of both of these procedures:

Simulated binary crossover: a genetic operator that creates two offspring solutions from two-parent solutions by simulating the binary crossover observed in nature. The procedure involves three steps:

Step 1:a uniform random number is chosen \(U \in (0.1)\)

Step 2:a spreading factor \(\beta _{\textrm{q}}\) is calculated using the following equation:

$$\begin{aligned} \beta _{\textrm{q}}= {\left\{ \begin{array}{ll}(2 \textrm{U})^{\frac{1}{\eta +1}} &{} \text { if } \textrm{U} \le 0.5, \\ \left( \frac{1}{(2(1-\textrm{U}))}\right) ^{\frac{1}{\eta +1}} &{} \text{ othrwise } \end{array}\right. } \end{aligned}$$
(19)

The ordinate \(\beta _{\textrm{q}}\) is discovered so that the probability curve’s area below it from 0 to \(\beta _{\textrm{q}}\) equals the selected random integer U. A single-point crossover in binary-coded GAs is found to have a similar search power to the probability distribution used to produce a child solution:

$$\begin{aligned} p(\beta )= {\left\{ \begin{array}{ll}0.5(\eta +1) \beta ^n &{} \text { if } \beta \le 1, \\ 0.5(\eta +1) \beta ^{\frac{1}{\eta +2}} &{} \text { othrwise }\end{array}\right. } \end{aligned}$$
(20)

The uniform random number in the simulated binary crossover procedure is denoted by ’ \(\eta \) ’ and serves as the distribution index. This index is a non-negative real number that influences the probability of creating near-parent solutions. When the value of ’ \(\eta \) ’ is large, the probability of creating solutions that are similar to the parents is higher. In contrast, when ’ \(\eta \) ’ has smaller values, the probability of creating solutions that are distant from the parents is higher, resulting in a greater exploration of the search space.

Step 3: from the parent solutions \(x_i^{(1, t)}, x_i^{(2, t)}\), two children’s solutions, \(x_i^{(1, t+1)}\) and \(x_i^{(2, t+1)}\) can then be calculated using Equations (25) and (26).

$$\begin{aligned} \begin{aligned}&\textrm{x}_{\textrm{i}}^{(1, t+1)}=0.5\left[ \left( 1+\beta _{\textrm{q}}\right) \textrm{x}_{\textrm{i}}^{(1, t)}+\left( 1-\beta _{\textrm{q}}\right) \textrm{x}_{\textrm{i}}^{(2, t)}\right] \\&\textrm{x}_{\textrm{i}}^{(1, t+1)}=0.5\left[ \left( 1-\beta _{\textrm{q}}\right) \textrm{x}_{\textrm{i}}^{(1, t)}+\left( 1+\beta _{\textrm{q}}\right) \textrm{x}_{\textrm{i}}^{(2, t)}\right] \end{aligned} \end{aligned}$$
(21)
Fig. 6
figure 6

Simulated binary crossover

Fig. 7
figure 7

Polynomial mutation

Polynomial Mutation: The objective of the mutation operation is to reestablish unexpected genetic individuals in order to prevent finding locally optimal solutions and thereby improve the population diversity and exploratory potential. What it does is as follows:

$$\begin{aligned} \textrm{p}^{\prime }=\left\{ \begin{array}{ll} p+\overline{\delta _L}\left( p-x_i^{(L)}\right) &{} \text { for } \textrm{U} \le 0.5 \\ p+\overline{\delta _R}\left( x_i^{(R)}-p\right) &{} \text { for } \textrm{U}>0.5 \end{array}\right. \end{aligned}$$
(22)

Where \(\textrm{p}^{\prime }\) is the child solution, p is the parent solution, \(x_i^{(L)}\) and \(x_i^{(R)}\) are the upper and lower limits of the parent component, respectively, the parameters \(\left( \overline{\delta _L}, \overline{\delta _R}\right) \) are functions of \(\textrm{U}\) and \(\eta \) as shown in the following formula and they are calculated, as follows:

$$\begin{aligned} \begin{array}{lll} \overline{\delta _L}=(2 U)^{1 /(1+\eta )}-1 &{} \quad \text {for } U \le 0.5 \\ \overline{\delta _R}=1-(2(1-U))^{1 /(1+\eta )}, &{}\quad \text { for } \quad U>0.5 \end{array} \end{aligned}$$
(23)

Here, \(\eta \) is the mutation distribution index and U is a uniform random integer between (0, 1), U is a uniform random integer between (0, 1). Figures 6 and 7 illustrate how the Simulated binary crossover and polynomial mutation functions respectively.

In this research, we harness the power of real-coded genetic algorithms (GAs) to optimize hyperparameters within the context of neural network configuration. The choice of real-coded GAs is underpinned by their distinct advantages, which make them a compelling choice for this intricate task. Unlike binary encoding or other discretization methods, real-coded GAs offer a seamless representation of the real-valued hyperparameters inherent in neural network architecture.

In conjunction with real-coded GAs, we employ a sophisticated selection method known as two-individual tournament selection. This technique introduces competition among individuals, enhancing the evolutionary process. Two individuals are randomly selected from the population, and their fitness function values are compared to determine the one with the highest fitness. This selection procedure is repeated for a second pair of randomly chosen individuals, and a winner is again selected. The chosen victors are then passed to the crossover operator. The utilization of tournament selection adds an element of diversity and competition to the selection process, allowing the algorithm to efficiently explore the solution space and maintain genetic diversity. Additionally, we integrate an elitist approach into our selection process, ensuring that the best individuals are always preserved and carried forward unchanged, which is critical in preventing the loss of highly fit solutions.

Complementing the power of real-coded GAs and tournament selection, we employ sophisticated genetic operators such as Simulated Binary Crossover (SBX) and Polynomial Mutation to further enhance the optimization process. SBX, a powerful crossover operator, is designed to balance exploration and exploitation. By simulating binary-like crossover for real-valued variables, SBX effectively maintains diversity in the population, thereby preventing premature convergence and efficiently exploring the search space.

In addition to SBX, we utilize Polynomial Mutation to introduce random perturbations to variable values. This mutation operator adds essential diversity to the population, mitigating the risk of falling into local optima and ensuring a comprehensive exploration of the solution space. The selection of real-coded GAs, SBX, and Polynomial Mutation is underpinned by their unique qualities, making them an ideal choice for classification tasks. Their adaptability and flexibility empower the simultaneous optimization of multiple hyperparameters within a unified framework, offering a comprehensive approach to hyperparameter tuning. This adaptability is vital in addressing the intricate interdependencies between hyperparameters in neural network configurations. Furthermore, the ability to strike a balance between exploration and exploitation efficiently guides the search process through the hyperparameter space, yielding optimized configurations tailored to classification tasks. Figure 8 presents the general flowchart methodology for the proposed process.

Fig. 8
figure 8

The diagram shows the optimization of MLP hyper-parameters using RCGA. The hyper-parameters for neural architecture and activation function are encoded as a chromosome vector to be optimized by GA. Additionally, weight solver, regularization, and learning hyper-parameters are also optimized simultaneously

5 Experiments

The results of the model put out in section 4 will be reported in this part. Google Colab and TensorFlow 2.0+, which includes the Keras deep learning framework; the most recent version of sci-kit-learn, Numpy, and Deap were used to implement the model. The k-fold cross-validation is technic is utilized for the experiments that present the suggested hyperparameters optimization model. The provided results for each fold were generated using various test sets for each fold, which were not used for classifier training. The standard RCGA Algorithm 1 used to optimize the hyperparameters of MLP neural network is as follows using the flow chart shown in Fig. 9.

Algorithm 1
figure a

Hyperparameter Optimization using Real-Coded Genetic Algorithm

Fig. 9
figure 9

The crucial steps in optimizing hyperparameters

5.1 Data Description and Preprocecing

A series of tests were carried out utilizing well-known benchmark datasets from the University of California Irvine (UCI) machine learning repository [69] in order to assess the performance of our suggested methodology. We specifically used four classification problems: the hypothyroid dataset, the iris dataset, the Wisconsin Breast Cancer dataset, and the wine dataset created by Fisher.

Table1. below provides a summary of the datasets utilized in this work, including details on each dataset’s number of samples, properties, and target classes.

Table 1 Description of the used databases

Wine The dataset includes 178 examples with three target classes and 13 real-valued characteristics. The outcomes of chemical examinations of wines produced in the same Italian region but from three different cultivars correlate to these classes, which correspond to 59, 71, and 48 examples, respectively. 13 distinct compounds found in each type of wine were quantified using chemical analysis.

Hypothyroid The Gravel Institute in Sydney, Australia provided the dataset for this study, which was then submitted to UC Irvine’s Discovery in Databases. This dataset is a portion of a broader collection of datasets; however, the “hypothyroidism” dataset, which includes 3163 occurrences and 24 hypothyroidism-related variables, is the subject of this study. There are numerous missing values in the dataset.

Iris 50 samples of each of the three target classes-Iris Setosa, Iris Virginica, and Iris Versicolor-make up the 150 iris flower samples in the collection. Sepal length, sepal width, petal length, and petal width are the four real-valued properties present in each sample, and they offer information on the morphological traits of each iris species.

Breast Cancer Wisconsin dataset consists of 699 cases, 458 of which have been classified as benign and 241 as malignant. The goal is to categorize the breast tumor as benign or malignant, and the dataset has no missing variables. The issue is a two-class classification task, and it is based on several properties of the cells that make up the nuclei. While the remaining 350 patterns serve as a test set, the first 349 serve as a training set for optimizing the neural network weights.

We used hypothyroid and breast cancer data preparation approaches to ensure the correctness and dependability of our findings. Data preprocessing includes converting raw data into a practical format as well as carrying out cleaning and formatting procedures to get rid of any discrepancies or flaws that can influence our results. We handled unbalanced data, where there were significantly more negative samples than positive samples, by filling in an attribute means, converting non-numerical labels to numerical ones, reducing noisy data, and handling missing data (as indicated in the table). To maintain uniformity in the data across all datasets, we also standardized the data using a similar formula for each attribute individually.

Scaled Value \(=\frac{X-{\bar{u}}}{S}\)

where X: a sample, \({{\bar{u}}}\): mean of training samples for a single feature, S: a standard deviation of training samples for a single feature.

5.2 Parameters Setting of MLP and RCGAs

For classification tasks in our work, we used a Multilayer Perceptron (MLP) classifier. We used evolutionary algorithms to find the best design that could attain the highest accuracy in order to enhance its performance. Backpropagation was used in the training of the MLP classifier. A homogeneous distribution of random weights between 0 and 1 was used to initialize the network.

A fivefold cross-validation method was used to assess how well the MLP will perform on new data. This was important because there weren’t enough cases in the dataset, and training with 90% of them might have resulted in overfitting. As a result, utilizing a threefold cross-validation strategy helped to reduce the risk of overfitting while allowing for a more accurate evaluation of the model’s generalization skills. The default MLP classifier parameters are shown in Table 2 along with the hyper-parameters that will be optimized by GA, the range of acceptable values, and gene locations.

Table 2 Hyper-parameters to be optimized by GA, the default parameters of the MLP classifier, and the range of permitted values

The suggested parameter settings for the RCGAs when used with various datasets are presented in Table 3. To get the best results for each dataset, particular parameter settings might be necessary.

Table 3 Suggested parameter settings for the RCGAs

5.3 Results and Discussion

The performance of the genetic algorithm used to analyze the wine dataset across the first six generations is shown in Table 4. With a population size of 30, a simulated binary crossover probability of 0.9, and a polynomial mutation of 0.3, the algorithm performed fivefold cross-validation. The population’s maximum and average fitness levels were seen to rise during the course of these generations. The minimum fitness value also demonstrated overall improvement. The fitness values’ standard deviation dropped, pointing to a closer grouping around the mean. These results point to the efficiency of the genetic algorithm in progressively improving population fitness and avoiding subpar solutions.

Table 4 Performance of the genetic algorithm applied to the wine datase

We may now look at Table5 to see the results of the genetic algorithm used on the hypothyroid dataset. The algorithm performed fivefold cross-validation using the same parameters as the wine dataset: a population size of 30, a simulated binary crossover probability of 0.8, and a polynomial mutation of 0.4. The maximum and average fitness values showed a substantial improvement across the generations, proving that the algorithm was successful in coming up with improved answers for this dataset. The fitness values’ standard deviation dropped, indicating a decline in fitness variety among the population. These findings show how effectively the genetic algorithm improves population fitness and converges on the best answers for the hypothyroid dataset.

Table 5 Performance of the genetic algorithm applied to the hypothyroid dataset

The results of the genetic algorithm used on the iris dataset are shown in Table 6.With a population size of 30, a simulated binary crossover probability of 0.9, and a polynomial mutation of 0.5, the algorithm conducted three cross-validations. Through the generations, the population’s maximum and average fitness levels have increased. This shows how the genetic algorithm improved population fitness and converged to the best answers for the iris dataset.

Table 6 Performance of the genetic algorithm applied to the iris dataset

Similar to this, Table 7 illustrates how the genetic algorithm used to analyze the cancer dataset improved after the first 16 generations. With a population size of 30, a simulated binary crossover probability of 0.9, and a polynomial mutation of 0.5, the algorithm conducted three cross-validations. The population’s maximum and average fitness values grew, indicating that the genetic algorithm was successful in enhancing population fitness and convergent toward the best answers for the cancer dataset. Overall, our findings demonstrate the value of the genetic algorithm as an optimization tool for improving population fitness and achieving optimal results across a range of datasets.

Table 7 Performance of the genetic algorithm applied to the cancer dataset

Overall, the four datasets’ neural network model outcomes were inconsistent. For the Wine dataset, a neural network model with six hidden layers, the relu activation function, the lbfgs solver, an alpha value of 0.30, and an adaptive learning rate produced the best results. This model had a 96.62% accuracy rate. Overall, it can be said that neural network models are capable of performing classification tasks; however, the best hyperparameters may change based on the particular dataset.

The neural network model with the best result for the Hypothyroid dataset featured 9 hidden layers, employed the relu activation function, the lbfgs solver, had an alpha value of 1.87, and had a constant learning rate. This model had a 98.67% accuracy rate.

For the Iris dataset, the best solution was achieved with a neural network model that had 13 hidden layers with 3 output neurons, using the ’tanh’ activation function, ’lbfgs’ solver, an alpha value of 0.48, and an invscaling learning rate. This model achieved an accuracy of 98%.

Finaly, for the Breast Cancer dataset, the best solution was achieved with a neural network model that had 10 hidden layers, used the ’relu’ activation function, ’lbfgs’ solver, an alpha value of 1.59, and an invscaling learning rate. This model achieved an accuracy of 96.48%.

The most accurate hyperparameters identified for each dataset are listed in Table 8, along with the MLP classifier’s related accuracies.

Table 8 Optimal configuration of hyperparameters applied to each dataset

We contrast the performance of our suggested model with that of numerous earlier research that made use of various categorization algorithms and methods. Table 9 presents the findings of this research. For all four datasets-wine, thyroid, iris, and breast cancer-our suggested approach outperformed earlier studies in terms of accuracy rates. These outcomes show how well our suggested model performs in precisely classifying various sorts of data.

Table 9 Accuracy comparison of MLPGA+5 with several MLP based approach

In summary, the suggested strategy performs better than other methods on a variety of datasets, starting with the Wine dataset:

The accuracy of the Wine BP model, which uses a backpropagation neural network with one hidden layer made up of 10 neurons, was 95.0%. This shows that the model was successful in learning from and accurately classifying the wine data. With the identical design of 1 hidden layer and 10 neurons, the GA Weights model, which uses a genetic algorithm, achieved an accuracy of 96.00%, which is just a little bit higher. This implies that using a genetic algorithm optimization strategy helped the model perform better on the Wine dataset. With the same 1 hidden layer and 10 neurons architecture, the GABP model, which combines the genetic algorithm with backpropagation approaches, also managed to reach an accuracy of 96.00%. This suggests that when compared to using backpropagation alone, the combined strategy produced better outcomes. The MLP-GA model, which included 2 hidden layers with 13 neurons and a genetic algorithm for weight optimization, had a lower accuracy of 77.45%. This outcome shows that the optimization strategy and architecture used might not have been appropriate for this particular dataset. The accuracy of the P.method model, which had hyper-parameters of 1 hidden layer and 6 neurons, was 96.00%, which is comparable to that of the GA Weights and GABP models. This shows that the selected configuration of the hyper-parameters was successful in achieving high accuracy for the Wine dataset.

The outstanding accuracy of 98.80% was attained by the Hypothyroid MMLP model, which makes use of a multi-layer perceptron with a feature scaling configuration of 1 high layer and 10 normal layers. This shows that the model effectively ingested and accurately identified the hypothyroid data. Comparatively, the accuracy of the MLNN-LM models with 50 neurons and 3 and 10 completely connected layers was 92.96% and 93.19%, respectively. These models show that adding more layers did not significantly increase accuracy, indicating that the added complexity did not significantly improve the dataset in question. A strong accuracy of 95% was attained by the MLP-GA model, which used genetic algorithm optimization with a range of 4 to 20 neurons. This shows that the weights of the model were successfully tuned by the genetic algorithm approach, leading to enhanced classification performance. The accuracy was 85.5% for the MLP-PSO model, which used particle swarm optimization with the same range of 4–20 neurons. This suggests that the genetic algorithm may have performed better than particle swarm optimization for this particular task. The P.method model had an accuracy of 98.67% with hyper-parameters set at 1 high layer and 9 normal layers. This outcome illustrates that the model’s performance was successfully maximized by the selected hyper-parameter configuration, yielding high accuracy for the Hypothyroid dataset.

Fig. 10
figure 10

Accuracy of different methods applied to wine dataset

Fig. 11
figure 11

Accuracy of different methods applied to hypothyroid dataset

Fig. 12
figure 12

Accuracy of different methods applied to iris dataset

Fig. 13
figure 13

Accuracy of different methods applied breast cancer dataset

The Iris GE-BP model had a 96.6% accuracy and a 6.14% standard deviation for the Iris dataset. This shows that generally speaking, the model was successful in categorizing the many species of iris flowers. The relatively high standard deviation, however, points to some performance variation between various runs. The accuracy of the MLPGA+4 Hyperparameters model, on the other hand, was remarkable, coming in at 98.87% with a standard deviation of 11.30%. This model consistently produced good accuracy on the Iris dataset by combining Multi-Layer Perceptron with Genetic Algorithm optimization and four chosen hyperparameters. With three hidden layers and four normal layers, the MLP-GA Architecture model had an accuracy of 97.30%. This shows that a highly accurate classifier for the Iris dataset was produced using the selected architecture in combination with weight-tweaking genetic algorithm optimization.

The reported results for various methods in breast cancer classification are as follows: The ANN method’s outstanding accuracy of 99.4% was attained using a 15-neuron ANN. A 97.51% accuracy was achieved using SBS-BPPSO (10 CV). SBS-BPLM (10 CV) attained a 98.83% accuracy. The accuracy of GONN1 was 98.24%. The accuracy of GONN2 was 99.63%. The accuracy of MLP hyper-parameters utilizing 1 hidden layer and 8 neurons was 97.70%. Last but not least, an accuracy of 97.00% was attained utilizing P.method hyper-parameters with 1 hidden layer and 10 neurons.

The results show that MLPRGA+5 performs differently on various datasets compared to the other approaches the authors have suggested, Figs. 101112, and 13 depicts a summary of the results that were attained. Although it shows promise in some datasets, it might not work well in others. Further research and experimentation are advised to improve the results in this particular dataset. This will aid in the method’s improvement and optimization, maybe producing better results.

6 Conclusion

this study introduces a novel and highly effective approach MLPRGA+5 to optimize hyperparameters for Multilayer Perceptron (MLP) neural networks leveraging the Real-coded Genetic Algorithm (RCGA). Our work focuses on achieving an optimal balance between MLP accuracy and complexity. RCGA, along with specific operators like Simulated Binary Crossover (SBX) and Polynomial Mutation, offers a robust framework for navigating the intricate landscape of hyperparameters. The selection of RCGA is rooted in its ability to directly manage real-valued hyperparameters, aligning seamlessly with the continuous nature of these parameters. The synergy with SBX and Polynomial Mutation further empowers the algorithm to efficiently explore complex optimization spaces. This research extends the frontiers of hyperparameter tuning in MLP networks, emphasizing the broader utility of RCGAs in tackling intricate optimization challenges. By concurrently optimizing neural network architecture and hyperparameters, we provide a more efficient approach to classification tasks. Ultimately, our study underscores the potential of RCGAs as a valuable tool for machine learning and artificial intelligence, promoting advancements in deep learning and effective solutions to real-world problems.