KeepNMax: Keep N Maximum of Epoch-Channel Ensemble Method for Deep Learning Models

Computer vision (CV) application is becoming a crucial factor for the growth of developed economies in the world. The widespread use of CV applications has created a growing demand for accurate models. Therefore, the subfields of CV focus on improving existing models and developing new methods and algorithms to meet the demands of different sectors. Simultaneously, research on ensemble learning provides effective tools for increasing the accuracies of the models. Nevertheless, there is a significant gap in research using data representation and model features. This led us to develop KeepNMax—an ensemble of image channels and epochs using the top N maximum prediction probabilities at the final step. Using KeepNMax, the ensemble error was reduced and increased the amount of data knowledge of the ensemble model. Nine datasets were trained. As long as each dataset had three channels, the images were divided into three different channels and trained them separately using the same model architecture. In addition, the datasets were trained without dividing them into different channels using the same model architecture. After completing the training, some epochs of the training were ensembled to the best epoch in the training. In addition, two different model architectures were used to check the model dependency of the proposed method and achieved remarkable results in both cases. This method was proposed for deep-learning classification models. Despite its simplicity, proposed method improved the results of the CNN and ConvMixer models for the datasets used. Classic training, bootstrap aggregation, and random split methods were used as the baseline methods. For most datasets, significant results were obtained using KeepNMax. The success of the method was explained by the unique true prediction ( $UTP$ ) scope of each model. By ensembling the models, the prediction scope of the model was enlarged, allowing it to represent broader knowledge about datasets than a simple model.


I. INTRODUCTION
Computer vision has recently begun to grow rapidly, and deep-learning (DL) models have begun to attain better results than classic machine-learning algorithms [1]. To achieve state-of-the-art results, models were developed using various DL tools which enabled to reach as high results as possible with lower errors in the models' prediction. One of the conventional tools of DL is model ensembling. DL model ensembles have numerous combinations that can be used with other tools. Nevertheless, most of the ensembled models The associate editor coordinating the review of this manuscript and approving it for publication was Yudong Zhang . deliver or create prediction errors during models' layer or prediction concatenation. To solve this problem, this research focused on detailed explanation of the reasons of the ensemble errors and development of the new ensemble model that allows to manipulate prediction probabilities of the models and features of the image channels. In this paper, an epoch-channel ensemble method was proposed with a specific number of prediction probabilities for DL models. The proposed method comprises two steps. The first is the epoch ensemble that was studied in [2], and the other is the channel's maximum prediction probability ensemble [3]. Initially, the data was trained using a simple model consisting of convolutional and dense layers. To obtain VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the expected result on the model, the datasets were trained using ConvMixer [4]. The architecture dependency of the proposed method served as the objective of training the datasets using ConvMixer. The proposed method yielded better results in the ConvMixer architecture. Considering that the training was conducted on nine different datasets, which had different numbers of classes, the proposed method was experimented on each of the datasets with different training specifications. The general skeleton of the method served as the core of the training and the obtained results. For instance, KeepNMax with an animal faces dataset, initially, presented dividing the dataset into three different channels (channels 1, 2, and 3) that were used for further training.
In the next step, the original dataset was trained with the model presented in Fig. 2 and saved the predictions from the model for further use in the ensemble. Then, channel 1 was trained using the model employed for the original dataset. The training of channel 1 was divided into five intervals, and the best epochs in each interval were selected for an ensemble at the end of training. The training process for the second channel was identical to that for channel 1. Five training intervals returned the best epochs, with regard to accuracy. In the next step, the best epochs were ensembled and saved for further ensembles. Channel 3 of the dataset was trained under the same conditions as channels 1 and 2. The epoch ensemble of channel 3 from the five training intervals was saved for the final ensemble. In the aforementioned training, the highest-accuracy epoch was saved to compare and select ensemble members during the implementation of the method. The abundance of knowledge in the original dataset ensures the main model's success in evaluation and advances in comparison to other models that are trained with each channel of the images in the dataset. Considering this, the top three prediction probabilities of the models were added that were trained with separate image channels to the appropriate prediction probability classes of the main model (trained with the original dataset). This method increases the prediction scope of the final ensemble by adding more insights and losing less knowledge of the main model. Details regarding this method are presented in Section III. The results of implementing KeepNMax indicated increased classification accuracies for all datasets. The motivation to use the epoch ensemble as a part of the proposed model was to share insights of certain epochs' prediction knowledge with the main model. This can be explained by the differences in the prediction scopes of the epochs. The prediction scope includes true predictions of the models or epochs. Even though the prediction accuracies of some epochs are lower than that of the main one, in some cases, they can predict images better than the main model. The same reason was used for the input image channel ensemble to increase the accuracy of the final model. Separately trained image channels have a number of images in which only one channel (the model that is trained with one channel) can be predicted accurately. If the true predictions of each channel was considered as a prediction scope of that channel, different image channel prediction scopes had unique true predictions that belong to only one channel. This was the main point that motivated us to use the channel ensemble. The reason for using only N maximum prediction probabilities of the secondary models was to reduce the error, which moves the prediction probability far from the ground truth image class. By using existing information via an optimal and effective approach (KeepNMax), the results of the main model were improved.

II. RELATED WORK
The high demand for DL products has led to the rapid development of CV. Previous research has significantly influenced the development of CV with DL. Studies have been performed on diverse forms of ensemble learning and their application to various fields. In [5], a DL-based multi-model ensemble was used to predict neuromuscular disorders, and in [6], cancer prediction using a DL ensemble was studied. Additionally, Covid-19 infection was predicted by employing CT images for the ensemble. DL ensembles can also be applied in the financial market and risk assessment. In [7], researchers focused on credit risk assessment using a six-stage neural-network ensemble. In [8], default risk analysis was the major focus, and three different DL ensembles were used to evaluate the default payment date. Oil prices are also an attractive topic for DL models. In [9], research was conducted on crude oil price predictions using a DL ensemble. By employing stacked denoising encoders and bootstrap aggregation, the authors combined the merits of the two aforementioned techniques, which are suitable for oil price prediction. A similar method was used in a previous study [10], which focused on image patches instead of image channels. The majority of the research on this topic has focused on the ensemble of models. Other forms of ensembles are bootstrapping and cross-validation methods, which were studied in [11]. To predict the ensemble generalization error reliably, cross-validation and bootstrapping were applied to generate random samples from the initial dataset. A crucial application of the DL ensemble was reported in [12]. The boosting ensemble method was used to predict heart disease using a high-dimensional dataset accurately. In [12], researchers applied feature fusion, combined data from sensors and electronic medical records, and selected relevant and important data for further training with the DL ensemble model. Modern healthcare systems are expected to be among the main artists and consumers of DL products. For diabetic retinopathy detection, Qummar [13] used insights of pretrained ResNet-50, Inceptionv3, Xception, Dense121, and Dense169 models, ensembling all of them to obtain the best result in the field. The proposed ensemble model can predict all the steps of diabetic, in contrast to other models, and it achieved better results than other models. The ensemble [14] of convolutional neural networks (CNNs) and deep residual neural networks is a powerful tool for hyperspectral image classification, providing competitive results compared with other state-of-the-art models. All the aforementioned models have advantages depending on the quality of the dataset. Common research problems for which studies should find 9340 VOLUME 11, 2023 solutions are imbalanced, complex, high-dimensional, and noisy data. Ensemble learning [15] is an important tool for solving these problems. To construct models effectively and achieve better results, various models have been developed, unifying data fusion, modeling, and mining. The final goal of all the models is to achieve knowledge discovery and better predictive results. To present ensemble models clearly, the existing ensemble learning methods were divided into four groups: supervised ensemble classification, semisupervised ensemble classification, clustering ensemble, and semi-supervised clustering ensemble. The main idea of supervised ensemble models is to generate classification results and integrate the final result using a voting system. Bagging or bootstrap aggregation [16] is a form of ensemble learning in which random samples from the training set are selected and trained with models. The final results from the trained models are averaged to obtain the mean accuracy of the dataset via bagging. Using this dataset, an accuracy of 92% was achieved for the animal faces dataset. However, this method does not include all the advanced tools of the DL ensemble for achieving a near-perfect accuracy.
AdaBoost [17] is a method that is similar to the bagging ensemble. It uses bags of data from the training set, and the model employs training data as a test set to choose bags. The next bags are selected according to the predictions of previous models. Weighted data from the training set are selected for the bags. Random forest [18] can achieve a higher accuracy; however, it is prone to overfitting, as it integrates the final voting results of multiple decision trees. Two other forms of supervised ensemble classifiers are random subspace and gradient boosting, which were applied in [19] and [20] for the ensemble modeling of landslide susceptibility and anti-money laundering recognition, respectively. Semisupervised [15] ensemble methods have exhibited advantages for limited datasets, where they use unlabeled data, focusing on expanding the training set. In the case of limited data, the semi-supervised ensemble model outperformed the other methods. An important part of semi-supervised models is to label unlabeled data using a pretrained model with existing true labeled data. Subsequently, a mixture of both pseudo-labeled and true labeled data is used to train the model. The general form of the model has been developed by numerous researchers and has been applied in many fields. Multi-label learning was proposed in [21]to improve the results of drug-target interaction tasks, which were achieved using binary classification models. The researchers facilitated multi-label classification with a community detection method for drug-target interactions and achieved excellent results. Lu et al. [22] proposed an improved rotation forest algorithm called the semi-supervised rotation forest algorithm (SSRoF). They used semi-supervised local discriminant analysis as a feature rotation tool. This algorithm achieved better results because knowledge was gained regarding the discriminative information and local structural information of small labeled and unlabeled data samples. Furthermore, many studies have been conducted to obtain high-quality unlabeled data. Studies [23], [24], [25], [26] have focused on clustering ensembles and their implementation in other research fields. Clustering ensemble studies the joint achievement of clustering models; however, it generates numerous clustering partitions and combines partitions to obtain a better consensus function. Soares et al. [27] studied labeling data in the presence of overlapping classes in dense regions and proposed a cluster-based boosting algorithm with cluster regularization applied to multiclass classification. In addition, various studies have been conducted on semi-supervised clustering algorithms. One of them is incremental semi-supervised clustering for high-dimensional data clustering [28], which utilizes the random subspace technique and constraint propagation approach. In addition, this method uses newly proposed local and global cost functions to select incremental ensemble members. [31] Presented the advantage of DL ensemble using FasterRCNN, TOOD, YOLOX and Cascade with Swin-Transformers for intestinal parasitic infections detection and was achieved higher performance than each method independently. Another recent study [32]  Most DL-based ensemble models consist of mixtures of various models and focus mainly on the final results, skipping the time and cost of training. Whereas other DL ensemble models study final classification methods and functions or data representations, a gap was found in that the ensemble of each channel and epoch significantly affects the accuracy of the final model. Another gap in models and methods is keeping a model or ensemble of models, which is a participant and has the maximum accuracy during all ensemble processes.

III. PROPOSED METHODOLOGY
To address the research problem found during the literature review, the KeepNMax method was proposed, which allows the model to use the insights of the images better than the classic training models. As there was no other alternative that ensembled epoch and image channels' knowledge simultaneously, while avoiding loss of insights, it was suggested to use image channels as an input. Instead of VOLUME 11, 2023 representing images only in RGB form, appending gains from image channels and epochs in each of the models was proposed, which are trained with a dataset consisting of only one channel image. Although the prediction scopes of ensemble members are lost in most DL ensemble methods, the study focused on gaining the part of the knowledge that is lost. In addition to the epoch and channel ensembles, the top N prediction ensemble was used. The top N predictions have a clearer understanding of the data than the predictions for all classes. Therefore, the top N prediction contributes significantly to the increase in the ensemble's accuracy. Fig. 1 shows the architecture of the proposed method. The first step is to separate the original image channels. The next step starts with training the original dataset using a selected model. In the experiments, two model architectures were used, as shown in Figs. 2 and 3, to check the model dependency of KeepNMax. Subsequently, each channel was trained using the model that was used for the original dataset.  To select the best epochs from the interval of training, the length I x of the intervals specific to each of the models was predefined: In Equation (1), x identifies the channel, and l, k, and p represent the lists of intervals that identify the training interval where the best models were selected to ensemble for each of the channels. After training each channel of the dataset, the top N prediction ensemble method is applied to the models.
Equation (2) gives the top N prediction indices (T i ), which identify the top N classes for the input image. P d is an array of predictions achieved at the end of the training that indicates which dataset or channel was used during the training.
KeepNMax is described as follows: 1) Initially, new datasets were created from the original datasets separating each channel from RGB images. The three images were divided into three channels, and each channel was saved as a grayscale image. At the end of this data preprocessing, four datasets were obtained: the original dataset and the channel 1, 2, and 3 datasets. 2) The second part of the method involved training the original dataset with the selected model architecture and saving the predictions (P 0 ) from the test set.
In the experiments, models were used with CNNs and ConvMixer. 3) Next, channel 1 was trained with the model architecture used for the original dataset. In this step, the best epochs were obtained from certain training intervals found in I x . For Channel 1, I x =l, where l represents the list of intervals. For instance, l = [80, 20] was used for the Cifar10 dataset with the CNN model, and the best epochs were selected from only two intervals after the model was trained for 80 epochs and 20 more epochs. From the chosen epochs, the test predictions were ensembled (P 1 ) and saved them for further use. 4) In the next step, channel 2 with the same model architecture and used the best epochs from the intervals given in k to obtain the prediction probabilities for the test set. The predictions (P 2 ) of the final ensemble from channel 1 were saved. VOLUME 11, 2023 5) Channel 3 was trained using the same specifications as channels 1 and 2. The intervals from p and the best epochs from these intervals were applied to the test set, and the achieved predictions (P 3 ) were saved. 6) Next, the top N (in Cifar10, top 3) predictions were selected from P 1 , P 2 , and P 3 to add them to the appropriate class of P 0 . Finally, P 0 + = P 1 (T i ) + P 2 (T i ) + P 3 (T i ) was obtained. As mentioned previously, two different models were used to evaluate the model dependency of the proposed method. The architecture of the first model is shown in Fig. 2. The model started inputting data into a convolutional layer with 32 3 × 3 filters. The next step was batch normalization (3), followed by max pooling with a 2 × 2 pooling size. The objective of batch normalization was to keep the mean output close to zero and the standard deviation close to 1. Equation (3) gives the batch-normalization function, where the input is the batch and the output is the normalized form of the input. γ , β, and ϵ are a small constant, learned offset factor, and configurable constant. All the layers in this model used the rectified linear unit (ReLu) activation function, except for the last layer. The ReLu function returns a positive part of its inputs, whereas the softmax function converts inputs into a probability distribution. The softmax function (5) takes the vector z of the K class predictions and normalizes it into K probabilities. The softmax function δ : R K → (0, 1) K , i = 1, . . . , K and z = (z 1 , . . . , z K ) ∈ R K whereas K is greater than one.

Cross entropy loss
The next layer is a convolutional layer with 64 kernels and the ReLU activation function, which outputs data for the batch-normalization layer. Subsequently, a max-pooling layer is applied to the output. The last convolutional layer utilizes 128 kernels of size 3 × 3, and the output is fed to the batch-normalization and pooling layers. To feed the output to the dense layers, the output was flattened and input it into a dense layer of 256 layers, where the ReLu activation function was used. A denser layer with 128 nodes was added to the model before the final classification layer. The last layer has nodes that depend on the number of classes in the dataset, and the softmax activation function is applied to this layer. The cross-entropy loss (6) was used to calculate the accurate loss for updating the parameters of the model. In Equation (6), v represents the number of classes, y represents the label of the input, and p represents the predicted value for the input.
A detailed description of how this model works and achieves better results is presented in Fig. 4. If models with low accuracies are ensembled, the results will increase under the effect of the UTP. As shown in Fig. 4, model A has the same number of predictions as model B, but the elements of the predictions are different, indicating that model A can predict sunflowers and roses accurately. Model B cannot predict sunflowers or roses; instead, model B has an advantage for predicting dandelions that are not truly classified by model A. When two models are combined, the insights of the models help to truly classify more images than separate models. Another factor that should be controlled in this process is the error effect of the model on the ensemble. To reduce the error rate, ensembling the top N prediction probabilities of classification was proposed, ensuring a low error rate and high accuracy growth in true predictions of the ensemble model. The proposed method achieved better results because it used the knowledge and prediction ability of the ensemble members; however, the members of the ensemble had lower accuracy than the best model in the training. Different epochs and their ensembles can predict various images truly, and the true predicted images of one model can differ from those of another. This gap inspired us to research in the field.

IV. EXPERIMENTS AND RESULTS
In this section, comprehensive information was provided about the experiments on nine different datasets and the results obtained using the proposed methods. In addition, the measurement of the prediction losses and gains during the ensemble process were described. Furthermore, the baseline method, training setup, and evaluation metrics are presented.

A. EXPERIMENTAL DATASETS
Nine different datasets were used. Most were selected from the Kaggle computer vision classification dataset. To conduct objective research, datasets with different numbers of classes and images were used, which may have affected the results. The use of similar datasets can lead to better results. To propose a general solution, datasets were used that could show the gap in proposed research. In addition, this may affect the applicability or implementation of the proposed method in other DL models and datasets.

1) ANIMAL FACES
This was the first dataset employed to train the model using the proposed method. Animal models were introduced by Choi et al. [29]. The dataset consisted of 16130 high-quality 512 × 512 images. Three classes were present: cats, dogs, and wildlife. In the training, the images of the dataset (divided by 255) were rescaled and resized the image to 64 × 64 pixels to increase the training time. Images from this dataset were used to create Fig. 1 and Fig. 3.

2) BEE OR WASP
The next dataset used for training was a bee or wasp dataset, 1 which contained four types of images: bees, wasps, other insects, and others (non-insects). A total of 7942 images   were used as the training set, 1719 images were used as the validation set, and 1763 images were used as the test set. For the training, the training set was divided into training, validation, and test sets. Out of 11421 images, 1500 images were for the test set and trained the remaining data with a validation split of 0.2. Images of this dataset were rescaled by dividing by 255 and resized to 64 × 64 pixels.

3) CIFAR10
The third dataset used in the experiment was Cifar10, 2 which contained a sufficient number of images for training the proposed method. Cifar10 was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. This dataset was famous as it was a labeled subset of 80 million tiny image datasets. Another advantage of the dataset is that it consisted of sameshape images, which made the training easier and the training results more transparent. For training, the size of the dataset was not changed. The input data was rescaled by dividing by 255. The obtained results were easy to present without augmentation effects or other preprocessing tools. Fig. 7 shows samples from the Cifar10 dataset.

4) FIRE
This dataset 3 was created to distinguish images containing fires from those not containing fires. It is useful for binary classification tasks and included 244 non-fire and 755 fire images. The images had various sizes. The images were resized to 128 × 128 pixels and rescaled the dataset by dividing by 255.

5) A LARGE SCALE FISH DATASET
A large scale fish dataset was introduced in [29] for segmentation and classification tasks. The dataset had 9000 fish images of nine classes. Each class represented one type of sea fish. In the original dataset, the images had sizes of 2832 × 2128 and 1024 × 768 pixels. The images were resized to 224 × 224 pixels. In addition, the dataset was rescaled by dividing each image by 255. 1000 images were used for the test set and 1600 images for the validation set, and the remaining images were used as the training dataset.

6) FLOWERS RECOGNITION
This dataset 4 included 4317 images of five different flowers: daisy, dandelion, rose, sunflower, and tulip. The sizes of the images were not fixed and were approximately 320 × 240 pixels. To use these images in the training, the images were resized to 64 × 64 pixels and rescaled them by dividing all the image channels by 255. 800 images were used from five classes as the test set, and the remaining 3517 images were used for training and validation, with a validation split of 0.1.

7) MALARIA CELL IMAGES DATASET
A malaria cell images dataset 5 was also used to validate the method for binary classification. It contained 27558 infected and uninfected images. The sizes of the images were not fixed; therefore, the images were resized to 64 × 64 pixels and applied rescaling by dividing by 255. For the test set, 2000 images were selected, and for the training and validation sets, 25558 images were used; the validation split was 0.1.

8) PLANT DISEASE RECOGNITION DATASET
This dataset 6 was selected because it contained images having channels with low variance. It was trained to check 4 https://www.kaggle.com/alxmamaev/flowers-recognition 5 https://www.kaggle.com/iarunava/cell-images-for-detecting-malaria 6 https://www.kaggle.com/rashikrahmanpritom/plant-diseaserecognition-dataset     for dependence on the channel variance level. The dataset contained 1532 images. There were 1322 images for training, 60 images for validation, and 150 images for the test sets. The images were resized to 128 × 128 pixels, and the images had three channels. All the image channels were rescaled to 255.

9) VEHICLE DETECTION IMAGE DATASET
Vehicle detection image dataset 7 is consist of 17760 images. Half of the dataset shows vehicle and the other part describes non vehicle images. The dataset was divided into training, validation, and test sets. Test set consist of 1500 images, and the rest used for training and validation with 0.1 validation split.

B. BASELINE MODEL
In this study, classic training, bootstrap aggregation, and random splits 8 were selected as baseline methods. These methods were selected because it could be applied any other FIGURE 14. Comparison of proposed and baseline models in terms of presicion, recall and f1-scores using the model 1 architecture (Fig. 2). C_T, Bag, R_S, and KeepNMax stands for classic training, bootstrap aggregation, random splits, and proposed methodology, respectively. DL ensemble to the method, and it could work simultaneously with other DL ensemble methods. In addition, there was no other method similar to the proposed method for comparing the final results. For the fire dataset, the unique true prediction (UTP) scope of each model or ensemble to another model or ensemble were calculated. Thus, it could be explained the loss and gain in the true predictions resulting from the proposed method. When a model was trained using the method, it could additionally use other ensemble models that did not create a barrier to work with this method and increase the accuracy.
C. TRAINING SETUP Python 3.6.12 and TensorFlow 2.1.0 were used to build the model architecture for the proposed method. In this experiments, 12 GB Nvidia Titan-XP with CUDA 10.2, on a computer with an Intel core-i9 11900F CPU with 64 GB of RAM were used. For training, randomly initialized the weight of the model and trained it for certain epochs. To train the model, the Adam was used optimizer with a default learning rate of 0.001 and a sparse categorical loss function. The model was prepared to train the nine datasets.

D. EVALUATION METRICS
A method was proposed that mainly focuses on the accuracy of the model. In previous studies, different metrics were used, such as the F1 score, recall, and ROC. For this research, two metrics were mainly focused on that meaningfully explained the method's achievements for different datasets. To present comparable results, precision, recall and f1 scores were used. All the datasets used were balanced, and wasn't evaluated the effects of different class weights on the final results. The accuracy was defined as the ratio of the number of true predictions to the total number of cases used to evaluate the model.
TP corresponds to the true predicted positive results, TN corresponds to the true predicted negative results, FP corresponds to the false predicted positive results, and FN corresponds to the false predicted negative results.
The second evaluation metric was the unique true prediction UTP (11), which identifies the percentage of unique VOLUME 11, 2023 TABLE 1. Comparison of proposed and baseline models in terms of accuracy using the model 1 architecture (Fig. 2). C_T, Bag, R_S, Ch1, Ch2, Ch3 and KeepNMax stands for classic training, bootstrap aggregation, random splits, channel 1, channel 2, channel 3 and proposed methodology, respectively. predictions for each model with respect to another. UTP(X ,Y ) finds unique true predictions for model X with respect to model Y . X represents the prediction scope of model X , and Y represents the prediction scope of model Y . These metrics explain why the proposed model achieved better results than the main model, in which only the main dataset was trained. The indices of the true predicted images differed among the models, even though they had the same accuracy. This allowed the ensemble to achieve better results.

E. EXPERIMENTAL RESULTS AND DISCUSSIONS
In this section, the experimental results and detailed discussions were presented. The validation accuracy and UTP were used as evaluation metrics in this study. These two metrics were selected to focus on explaining the work and clarifying future studies in this field. The prediction scope was calculated and obtained a unique true prediction for each member of the ensemble with respect to the main model.
For training, nine datasets with balanced classes were used. When the model was trained with the model 1 architecture (Fig. 2), only four of the nine datasets achieved a higher accuracy than classic training. Table 1 presents the evaluation results of the datasets for the model 1 architecture. For the animal face dataset, 100% accuracy was achieved by applying the proposed method. For the fire dataset, 98.99% accuracy was achieved when KeepNMax was applied. The trained model was also successful for the malaria cell image dataset: Comparison of proposed and baseline models in terms of accuracy using the model 2 architecture (Fig. 3). C_T, Bag, R_S, Ch1, Ch2, Ch3 and KeepNMax stands for classic training, bootstrap aggregation, random splits, channel 1, channel 2, channel 3 and proposed methodology, respectively.

TABLE 3.
Unique true predictions(UTP) of models using model 1 architecture (Fig. 2) and Fire datasets.
98.9% accuracy was achieved. For the vehicle detection image dataset, channel 1 of model 1 achieved 100% accuracy for the test set after KeepNMax was used. To present comparable results, precision, recall and F1 scores of the datasets were illustrated in Fig. 14. To improve the reliability of the proposed method, another model architecture was used that employed a new approach for data representation-ConvMixer-as shown in Fig. 3. ConvMixer used image patches as inputs and achieved excellent results. ConvMixer was also used to check the model dependency of the proposed method. When ConvMixer was used as the model architecture for training, the final results differed significantly from the previous model's results. Table 2 presents the evaluation results of the datasets for the model 2 architecture (Fig. 3). With this model, the bee and wasp dataset exhibited better results, with an accuracy of up to 91.53%. For Cifar10, the large-scale fish dataset, the flower recognition dataset, and the plant disease recognition dataset, the accuracy were 84.02, 92.21, 68.5, and 59%, respectively. These five datasets exhibited better results with proposed method than with the classic training. Two model architectures yielded two different prediction probabilities. Thus, the final prediction scope depended on the models used in the experiments. The growth and losses after the application of KeepNMax could be explained by enlarging and shrinking the prediction scope of the models, which were manually affected during KeepNMax's application. The reason for the difference between classic training and KeepNMax can be explained using the UTP metric, which indicated the unique true predictions of the models. Table 3 presents the UTP between channels, KeepNMax, and the classic trained model. Clearly, the application of KeepNMax changed the UTP of the models. In classic training, the accuracy was higher than the channel accuracy. However, when ensembling was performed using the proposed method, insights from images were obtained to predict more images. The precise vision of the UTP provided a clear understanding of the reasons for the accurate growth of KeepNMax. With regard to the number of true predictions, channel 1 differed from channel 3 by 7%, and channel 2 differed from the classic training by 2%. When these models were assembled, the number of true predictions of the ensemble exceeded that for the classic training. The top N maximum predictions of the channels were used to reduce the effects of errors on the ensemble model. Consequently, the prediction probabilities of the models added knowledge and reduced the error when the models were assembled. In this study, it was attempted to reduce the error and increase the probability of obtaining clear insights into the input data.

V. CONCLUSION
Our experiments proved that keeping any knowledge during and after training can increase the prediction scope of the models when it was ensembled using the proposed method. Nine datasets were used and obtained better results using KeepNMax. The results were excellent; for some datasets, 100% accuracy was achieved for the test set. This was explained by the prediction scope of the ensemble members in the experiments. In addition, it was proven that a large error or knowledge loss can be fixed by substituting the training model with another. In the training, KeepNMax affected the results of the datasets differently, depending on the model architecture. Epoch ensemble and image channel ensemble were used and added only the top N predictions from the classification probabilities to the main model. This reduced the error and increased the accuracy. However, there is room for improvement. The large computational burden for training and the existence of a small knowledge loss during ensembling are challenging issues to be resolved in further studies. He has edited more than 30 special issues in international journals, 52 books, and 35 conference proceedings. He is the author or coauthor of five monographs and more than 350 journals and conference papers. He has given 22 plenary and keynote speeches for in international conferences, and more than 40 invited lectures in many countries. His research interests include collective intelligence, knowledge integration methods, inconsistent knowledge processing, and multi-agent systems. He serves as a member of the Council of Scientific Excellence of Poland, a member of the Committee on Informatics of the Polish Academy of Sciences, and an Expert of the National Center of Research and Development and European Commission in evaluating research projects in several programs like Marie Sklodowska-Curie Individual Fellowships, FET, and EUREKA. He was a general chair or a program chair of more than 40 international conferences. He also serves as the Chair for IEEE SMC Technical Committee on Computational Collective Intelligence. He serves as the Editor-in-Chief for Since 2014, he also has been in charge of an intelligent service integration based on IoT big data, as part of Korea's another principal national research project ''BK+ ''. He has more than 50 publications. His research interests include natural language processing, ontology, knowledge engineering, information retrieval, and machine translation. He has so far been not only a co-chair of several international conferences, but also a Steering Committee Member of ICCCI and ACIIDS, and MISSI international conferences. More specifically, he has been the Assistant Secretary of ISO/TC37/SC4 for language resource management, from 2005 to 2007, and also the Secretary of Korean TC for ISO/TC37/SC4, during the same time. In 2006, he was the Director of the Korean Society for Cognitive Science (KSCS) and the Korean Information Science Society (KISS). In recognition of his great commitment and contribution to the related fields of study, he has been honored as a Distinguished Researcher of KIST, in 1988, by the Korea's Ministry of Science and Technology (MoST) and awarded a prize for Good Conduct from Kyunghee High School, in 1973.