Effect of dual-convolutional neural network model fusion for Aluminum profile surface defects classification and recognition

Classifying and identifying surface defects is essential during the production and use of aluminum profiles. Recently, the dual-convolutional neural network(CNN) model fusion framework has shown promising performance for defects classification and recognition. Spurred by this trend, this paper proposes an improved dual-CNN model fusion framework to classify and identify defects in aluminum profiles. Compared with traditional dual-CNN model fusion frameworks, the proposed architecture involves an improved fusion layer, fusion strategy, and classifier block. Specifically, the suggested method extracts the feature map of the aluminum profile RGB image from the pre-trained VGG16 model’s pool5 layer and the feature map of the maximum pooling layer of the suggested A4 network, which is added after the Alexnet model. then, weighted bilinear interpolation unsamples the feature maps extracted from the maximum pooling layer of the A4 part. The network layer and upsampling schemes ensure equal feature map dimensions ensuring feature map merging utilizing an improved wavelet transform. Finally, global average pooling is employed in the classifier block instead of dense layers to reduce the model’s parameters and avoid overfitting. The fused feature map is then input into the classifier block for classification. The experimental setup involves data augmentation and transfer learning to prevent overfitting due to the small-sized data sets exploited, while the K crossvalidation method is employed to evaluate the model’s performance during the training process. The experimental results demonstrate that the proposed dual-CNN model fusion framework attains a classification accuracy higher than current techniques, and specifically 4.3% higher than Alexnet, 2.5% for VGG16, 2.9% for Inception v3, 2.2% for VGG19, 3.6% for Resnet50, 3% for Resnet101, and 0.7% and 1.2% than the conventional dual-CNN fusion framework 1 and 2, respectively, proving the effectiveness of the proposed strategy.


Introduction
The aluminum profile is a relatively common material in infrastructure construction and industrial manufacturing that is lightweight, has high strength, corrosion resistance, formability, and is recyclable [1]. It is extensively used in rail transit, construction facilities, automobile manufacturing, equipment manufacturing, medical equipment, and other industries [2]. In the production or use process of aluminum profiles, due to external factors, it may present the defects of inconsistency in various sizes and shapes, seriously affecting the safety and reliability of aluminum profiles. Therefore, detecting and ensuring the surface quality of aluminum profiles is significant to improve the product's service life [3].
In the past, object defects were commonly detected manually. It was a simple, highly repetitive, cost-wasting and labor work, where accuracy and stability could not be guaranteed. With the advancement of optical instruments, numerous scholars have used machine vision to realize defect recognition and improve the detection stability and recognition rate [4]. For example, Gao et al. [5] exploit thermal imaging technology to propose a low-rank tensor sparse mixture Gaussian (MoG) decomposition algorithm for natural crack detection. Their method reduces noise interference and extracts crack information to realize metal defect detection. Luo et al. [6] suggest a hybrid spatial and temporal deep learning architecture for automatic thermography defects detection that extracts internal defect information of composite materials with complex shapes and patterns. Accordingly, Hu et al. [7] developed a hybrid multi-dimensional feature fusion structure involving spatial and temporal segmentation appropriate for automated thermography defect detection of composite materials. Ahmed et al. [8] use the optical pulse thermal imaging diagnosis system and propose a joint sparse low-rank matrix decomposition algorithm to separate weak defect information from intense noise in composite materials and improve defect resolution. Sun et al. [9] investigate weld defect detection and classification based on machine vision. They categorize the weld defects and suggest a modified background subtraction method based on Gaussian mixture models to extract the feature areas of the weld defects, which are then employed to design classification algorithms. Zhang et al. [10] design an image acquisition system to simultaneously collect weld images and propose a new CNN classification model with 11 layers to identify weld penetration defects based on weld imagery. Bao et al. [11] propose a Triplet-Graph Reasoning Network (TGRNet), which combines surface defect triples (including a triple encoder and triple loss) to segment the background and defect areas, and separates them into metal and non-metal classes (leather and tile). For this method, the data is centralized to verify the network's effectiveness. Shervan et al. [12] focus on the surface defect detection problem, considering a new noise-resistant and multi-resolution version of LBP to extract surface features. Additionally, the authors propose a surface defect detection algorithm that is invariant to the texture descriptor. The effectiveness of this technique is verified in architectonic stone and Fabric Textile. Jong et al. [13] suggest a new convolutional variational autoencoder (CVAE) to generate sufficient defect data. Defect classification algorithm based on deep CNN for metal surface defect detection has also been proposed. Ihor et al. [14] design an automated method for detecting and classifying three types of surface defects in rolled metal, and use Resnet50 for feature extraction and defect classification.
Guan et al. [15] utilize VGG19 to extract steel surface defects and suggest different feature layers originating from the defect weight model. Then the authors employ SSIM and a decision tree to evaluate the image quality and adjust the network's structure and classify steel surface defects.
The latter method uses image processing and deep learning methods to extract the defect features of various objects effectively, and to a certain extent, provides insights for the method developed in this paper. Since this article mainly considers identifying and classifying defects on the aluminum profiles surface, the following works introduce the related literature. Defect recognition exploiting conventional machine vision mainly includes image capturing, feature extraction and definition, image preprocessing, and defect-recognition [16]. In this regard, the defect recognition accuracy is seriously affected by the accuracy of the feature extraction process and the method defining the features. Liu et al. [17] employ the gray-level co-occurrence matrix algorithm and the Gabor wavelet transform method to extract the surface texture features of aluminum profiles. They classify the features based on the radial basis function kernel SVM (Support Vector Machines) classification algorithm. Chondronasios et al. [18] propose a new technology based on the gradient-only co-occurrence matrix (GOCM) and the Sobel operator to extract and define the surface features of aluminum profiles.The authors use two-layer ANNs to classify the surface defects of aluminum profiles. Although traditional machine vision-based methods utilize image processing for surface defect feature extraction and defects classification, the extraction and defects definition requires manual processing and empirical judgment by engineers [19], which lacks robustness and is not conducive to operation.
Recently, deep learning has been extensively used in various application, including feature extraction and classification of aluminum profiles surface defects, due to its ability to learn image features automatically. In the context of aluminum profiles surface defects, Li et al. [20] rely on the adaptive threshold method to binarize the surface image of the aluminum plate, extract image features, and implement surface defect classification through a three-layer BP neural network. Wei et al. [21] utilize Resnet101 as the primary network and propose a multi-scale defect detection network based on deep learning to identify and classify surface defects of aluminum profiles. Neuhauser et al. [22] propose a VGG16 based architecture suitable for actual industrialization exploiting transfer learning. and data augmentation to increase the data set, avoiding model overfitting. Zhang et al. [23] design an attention mechanism to detect surface defects of aluminum profiles. This method initially exploits the category representation network to extract the common category feature map (CCM). Then, the attention module generates the proposed feature map (PM), and a rare category feature map (RCM) is formed through CCM and PM. After that, the score of the defect category is obtained through CCM and RCM spatial pooling for defects identification. Chen et al. [24] propose an aluminum profiles surface defect detection method relying on a deep self-attention mechanism (DSAM) under hybrid noise conditions. This technique employs the residual learning strategy to obtain the defect feature map from the image, adds the corresponding weight matrix to the defect feature map to achieve fine feature extraction, and finally adds a softmax classification layer for defect recognition. Liu et al. [25] develop a semi-supervised anomaly detection method, entitled Dual Prototype Auto-Encoder (DPAE). During the training phase, a dual prototype loss and reconstruction loss are introduced to encourage the latent vector generated by the encoder to be closer to its own prototype. Finally, the distance between the image's latent vectors is used to detect and identify the surface defects of the aluminum profile.
The above works exploit deep learning to identify and classify the surface defects of aluminum profiles and achieve good experimental results. Additionally, compared with traditional machine vision, deep learning-based feature extraction is more robust. However, there are still some issues that need to be resolved. For example, current deep learning methods utilize an input source, a neural network model, and the characteristic information of a single information source extracted through the neural network, which cannot fully reflect the characteristics of the object examined [26].
To solve these problems, the defect classification accuracy can be enhanced through a dualconvolutional neural network(CNN) model fusion framework that extracts the input source features separately, which are then fused. A dual-CNN model fusion framework may have two forms, either employing two different input sources or the same input source. In the former case, the same neural network model extracts features of different input sources and then merges them for classification [26][27][28][29][30][31]. This case involves neural network models with specific structure differences, e.g., CNN convolution kernel size and number, and several operation differences in the model learning process. In the same input source case, the classification performance varies depending on the extracted features [32]. For this fusion scheme, two different convolutional neural networks separately extract features and then merge them, aiming during the design, the extracted features to complement each other [32,33].
Duan et al. [26]propose a dual-CNN model fusion framework based on gradient images to identify and classify the surface defects of aluminum profiles. The original and gradient images are used as two different input sources, while both neural network models use Alexnet and realize feature fusion through wavelet transform fusion. Then the fused features are input into the SVM classifier block for defect classification, Akilan et al. [33] use the VGG16 and Alexnet networks to extract features from two identical input sources, and employ PCA (Principal Component Analysis) and energy normalization to form a feature space. This work also utilizes algorithm rules (Sum, Average, Max, Min) to fuze the features, with several rules being evaluated to select the optimal fusion strategy. The fuzed features are then input into an SVM classifier block for classification. Experimental results employing this method demonstrate that the Sum strategy is effective in most data sets. The first fusion framework mentioned above combines the output features of the first dense layer of the two Alexnet models, while the second fusion framework combines the output features of the first dense layer of VGG16 and the second dense layer of Alexnet. Both model fusion frameworks have a common attribute: the fused features are first input to the first dense layer of the classifier block, and then classification is achieved through multiple network layers. (the fusion framework will be introduced in the next part of this article). It should be noted that given the lack of research on recognizing and classifying aluminum profile surface defects utilizing a dual-CNN fusion framework, this article mainly refers to methods applied in other fields aiming to suggest the necessary improvements to facilitate a solution appropriate for aluminum profiles.
This article proposes an improved dual-CNN model fusion framework that uses the same input source and different convolutional neural networks (VGG16 and Alexnet). We add multiple network layers before the feature fusion process and after the Alexnet network (including convolution, pooling, and activation). The RGB image feature map is extracted from the last maximum pooling layer in the pre-trained VGG16 and the last maximum pooling layer in the network layer added after the Alexnet network. Then, we use weighted bilinear interpolation to upsample the the maximum pooling layer feature maps of the network layer added after Alexnet to ensure that the feature maps output by the two models have the same dimensions. Feature map fusion relies on the improved wavelet transform fusion method. Finally, our method develops a classifier block (see Section 2) utilizing a global average pooling layer instead of a dense layer.
Compared with traditional dual-CNN model fusion frameworks [26,33], we extract the feature maps of the largest pooling layer at the end of the proposed CNN model, fuse these feature maps, and use global average pooling for classification rather than dense layers. This strategy preserves more local feature information extracted from the image and reduces the model's dimensionality, making the network easier to train, avoiding too many weight parameters when the feature map enters the dense layer, which leads to overfitting during the model training process [34][35][36]. Regarding the feature fusion strategy, the improved dual-CNN model fusion framework uses an improved wavelet transform that combines the Canny operator and the area energy method (see Section 2). The remainder of this article is organized as follows. Section 2 introduces the dual-CNN model fusion framework, network layer function, up-sampling, feature fusion methods, and model training methods proposed in this paper. Section 3 describes the experimental setup and the evaluation metrics, while Section 4 presents the experimental results and analysis. Finally, Section 5 concludes this work.
To improve readability, some of the abbreviations presented throughout the text are defined as follows. Support Vector Machines (SVM) is a class of generalized linear classifiers that classifies binary data in a supervised learning manner. Its decision boundary is the maximum hyperplane margin solved for the learning sample. Principal Component Analysis (PCA) is a standard data analysis method, often used for dimensionality reduction of high-dimensional data that can be utilized to extract data's main feature components. Local Response Normalization (LRN) is a local normalization method that primarily prevents the neural network model from overfitting during the training process.

Methods
This section mainly introduces the related methods utilized in the experiments of Section 3, including the dual-CNN model fusion framework, the definition of the relevant network layers, the feature map upsampling method, the feature fusion strategy, transfer learning, data augmentation, and model performance evaluation methods.

Dual-convolution neural network model fusion frameworks
A traditional convolutional neural network framework includes a neural network model and a single classifier to extract feature information from the input source. This framework is called a single convolutional neural network. In contrast, the multi-convolutional neural network model fusion framework involves multiple convolutional neural network models that extract several features from given training data and inputs the fused features into a single classifier for classification [28]. The dual-CNN model fusion framework includes two network models. The input source features are extracted from the two models, are fused [29,30,32], and are then input into a single classifier for classification. Figures 1 and 2 illustrate the two different dual-CNN model fusion strategies [26,33].
The aluminum profile images are the input source of both fusion frameworks, which are employed to analyze the changes of the corresponding feature maps. The input source of Figure 1 involves the raw image (224X224X3) with the CNN network structure involving the pre-trained VGG16 and Alexnet models. The input source of Figure 2 considers two different images, namely the original (224×224×3) and its variant after image processing (224×224×3), i.e., gradient processing to form a gradient image and enhance the image's edge information [26]. This CNN network structure exploits two Alexnet models that independently exploit each input image. Figure 1 highlights that the first dualconvolutional network model fusion framework combines the output features of the first dense layer of VGG16 and the second dense layer of Alexnet. The architecture presented in Figure 2 combines the output features of the first dense layer of each Alexnet model, and the fused feature map is input to the classifier block through the dense layer for classification. The CNN's feature maps from the largest convolutional layer input to the dense layer in Figures 1 and 2 are 7×7×512 and 6×6×256, respectively. The output dimension is 4096×1, and the number of weight parameters is 102760448 and 37748736, respectively. It should be noted that the excessive number of weight parameters during training increases the possibility of model overfitting [34].   The proposed dual-CNN model fusion framework is illustrated in Figure 3. The image input dimension is 224×224×3, and the feature map is extracted from the pre-trained VGG16 and the Alexnet models. The output feature map acquired from the V5 part of the VGG16 model is 512×7×7, while the network layer A4 added after the Alexnet model ensures 512 output feature map channels. Then we upsample the feature map generated by the largest pooling layer in the A4 part, from 3×3 to 7×7, and the Feature Fusion part performs feature fusion preserving the feature maps' channel number and size. The fusion feature map is 512×7×7 and is directly input to the global average pooling layer. Considering the latter layer, global average pooling is performed only on the feature map without training the weight parameters to avoid overfitting due to the excessive number of parameters. The output size is 512×1×1, and finally, the classifier block classifies the output result. To the best of our knowledge, this dual-CNN model fusion framework employing global average pooling instead of dense layers to build classifier blocks has not been applied yet to classify and detect aluminum defects. The feature fusion, upsampling, and global average pooling schemes will be introduced in detail in sections 1.2 and 1.3 of the main text.
The above dual-CNN model fusion framework exploits VGG16 and Alexnet as the primary neural network models. Both have demonstrated outstanding results in image classification and target detection tasks and their generalization performance to migrate to other image data domains [19,37,38]. Furthermore, other existing CNN models may impose some training complexity due to complex connections and deeper structures, while VGG16 and Alexnet are "straight-type" structures. As the number of network layers increases, the extracted features represent finer details, better-facilitating feature fusion. Additionally, Alexnet is the 2012 ImageNet competition champion containing five convolutional layers (including the activation and pooling layers), three fully connected layers, and the classifier output category is 1000. The VGG16 model is the champion of the 2014 ILSVRC competition classification project, with the model containing 16 convolutional layers (including the activation and pooling layers) and three fully connected layers.

A brief overview of convolutional neural networks
(1) Convolution layer and pooling layer The convolution layer extracts data features from the input image through convolution operations [39]. Convolution is a linear operation between the input image and the convolution kernel involving a dot product operation within the convolutional process between the convolution kernel and the input image's receptive field. The convolution kernel size is increasing with a specific step size to match the various receptive field sizes. The convolution function is: where * denotes the convolutional operation, and are the weight and bias vectors of the convolutional kernel j, respectively, and denotes the input of the convolutional layer. The pooling layer aims to downsample the feature map, compress the model features, and simplify the network complexity. The pooling process can be distinguished into Max pooling and Mean pooling. In the proposed CNN network structure, the model uses maximum pooling in the early stage to reduce redundant features and extract texture and other features, while in the later stages, average pooling retains the image background features [40].
(2) Global average pooling layer Global Average Pooling (GAP) is a method for spatial dimensionality reduction through pooling. Employing global average pooling rather than dense layers affords to reduce the model parameters, avoids over-fitting, and improves the entire network's generalization ability. Additionally, the spatial\semantic information extracted by each convolution and pooling layer is preserved [35,41,42]. In this paper, the feature map produced by our method dimension after fusion is of size 512×7×7. After fusion, we reduce the model's parameters utilizing global average pooling to calculate the average feature map of all pixels within each channel. The final output model is 512×1×1, with the corresponding schematic diagram of the global average pooling illustrated in Figure 4. GAP replaces the dense layer that generates many parameters after the feature fusion process. Since the global average pooling layer has no parameters, it can prevent the layer from overfitting, integrate global spatial information, and have better robustness to the spatial translation of the input image. (3) Dense layer with dropout The dense layer resizes the features extracted by the convolutional and pooling operation and guarantees that features can be mapped regardless of their sizes. When the training sample size is small and the model parameters are many, over-fitting is prone to occur, and the model's generalization ability is weakened. Dropout reduces the possibility of overfitting, achieving the regularization effect [43], i.e., during the forward propagation process, the activation value of the neuron stops according to the defined Dropout date. Dropout reduces the complex cooperative adaptation relationship between neurons and avoids overfitting. We adopt [44] and set the Dropout date parameter to 0.5.
(4) Batch normalization layer Adding batch normalization to the training process of the CNN model can achieve a stable activation value distribution, ensure the input data distribution per layer is relatively stable, and accelerate the model's learning process. Batch normalization reduces the model's sensitivity to the network's parameters, simplifies the tuning process, stabilizes the network learning, and has a particular regularization effect in the model training process [45]. Thus, we use batch normalization to maintain all layer inputs on the same range in the classifier block.
(5) Activation layer and softmax The CNN model implements linear operations through convolutional layers in the forward propagation process. Multiple linear transformations in the network cause data expansion and insufficient model classification capabilities. The activation layer completes the nonlinear data transformation, performs data normalization, prevents overflow caused by excessive data, and increases the network's capabilities. ReLU (Rectified Linear Unit) was introduced as a nonlinear activation function, increasing network nonlinearity, preventing gradients from disappearing, and reducing network training time [46].
The Softmax layer is placed at the end of the model, and its function is to map the generated sample label space to (0,1) as the result of the classification task. The Softmax function is given by: where ( ) is the softmax layer input, ( ( ) = | ( ) ; ) represents the probabilities of the ith training example，and "n" denotes the model output. class cardinality with the sum of the class probabilities being one. Finally, the proposed dual-CNN fusion framework outputs the probability values of 4th four categories through a softmax layer. Thus we set n = 4. We employ cross-entropy as the loss function in softmax to determine how close the actual output is to the expected output, as in multi-classification tasks, the experimental effect of cross-entropy is closer to the ideal value. The cross-entropy loss function is: where , is the true label value, , represents the probability value corresponding to the k-th label under the ith sample, and N is the total number of samples.
(6) Classifier block Figure 5 presents the classifier block, including a global average pooling, dense, batch normalization, ReLU, dropout, and softmax layer. If too many parameters exist in the dense layer, utilizing global average pooling instead of a dense layer reduces the model weight parameters and avoid overfitting. Accordingly, batch normalization maintains all layer inputs in the same range, dropout prevents overfitting due to being a regularization technique, and Softmax defines the output category probability.

Upsampling and Feature fusion
(1) Upsampling methods When the VGG16 and Alexnet models perform feature map fusion, the feature map size is inconsistent. Thus, we upsample the feature map of the Alexnet model to expand its size. Commonly upsampling methods include deconvolution, depooling, and interpolation [34]. The interpolation method is simple to operate and easy to implement, and thus in this work, we employ interpolation for upsampling. Standard interpolation methods mainly include the Nearest Neighbor, Bilinear, and Bicubic Interpolation [47]. The Nearest Neighbor interpolation algorithm is processing efficient but imposes noticeable distortion, mosaic, and aliasing [48]. Therefore, this article mainly compares the effects of Bilinear, Bicubic, and Weighted Bilinear interpolation.
Bilinear interpolation (BI): Figure 6 presents a schematic diagram of a bilinear interpolation process, with the pixel value at point being the one to determine. 12 and 22 are pixels with known pixel values in the same direction. The pixel value of 2 can be obtained by linear interpolation between 12 and 22 , and the pixel value of 1 by linearly interpolating 11 and 21 . Finally, the pixel value of point can be calculated by linearly interpolating 1 and 2 . For this process, the involved formulas are Eqs 4-6. Specifically, the output of function is the p's pixel value. Given the known value of 11 ( 1 , 1 ), 12 ( 1 , 2 ), 21 ( 2 , 1 ), 22 ( 2 , 2 ), where and are pixel coordinates, by using bilinear interpolation, we first Interpolate 11 and 21 in the direction to get: According to 12 and 22 : Then interpolate 1 and 2 in the direction to get: where and are the weight values in the and directions, respectively. Eq 8 presented next is the calculation formula for the pixel value of the point after the weight is added: Bicubic interpolation (BCI): The difference between bicubic and bilinear interpolation is the increase in fitting data. Assuming that the original image size is (m, m) and the interpolated target image size is (M, M), we first determine the image ratio relationship m/M = 1/K, and the unknown point P(X, Y) corresponds to the original image in the target image. For the coordinates p (X/K, Y/K) on the image, the bicubic interpolation needs to find the nearest 16 pixels around point p. Then the bicubic function is constructed to calculate the weight of the 16 nearest pixels, and the pixel contribution value is obtained by the product of the weight and the pixel value [48]. for | | ≤ 1 | | 3 − 5 | | 2 + 8 | | − 4 for 1 < | | < 2 0 otherwise (9) where w(x) is the bicubic function that obtains the coefficients corresponding to the 16 adjacent pixels to pixel p, and a = −0.5. The weight of 16 pixels can be calculated from Eq 9, and the pixel value of P can be calculated from: where represents the pixel to be fitted, ( ) the weight on the abscissa of , and ( ) the weight on the ordinate of .
(2) Feature fusion methods Currently, several feature fusion methods exist, with the most common ones being sum, maximum, and wavelet transform fusion [26], with each feature fusion method having a particular impact on the experiment's accuracy. During the experiment, we compare the classification accuracy of various fusion methods, including sum, maximum, wavelet transform, and improved wavelet transform fusion. These methods are introduced next, and their interplay on the classification accuracy is presented in Section 4.
Sum fusion (SF): This is a standard feature fusion method, which is often utilized in image pixellevel feature fusion and feature-level fusion schemes [26,33]. The summation fusion involves adding the corresponding pixels in the same dimension of the two feature maps. The summation and fusion formulas are: where represents the total fusion feature map of size 512X7X7, k = 1,2,3…,512 represents the feature map channels, with a feature map size per channel of 7X7 (the dimension remains unchanged after the feature map fusion process completes). ( , ) represents the pixel ( , ) value for the kchannel in the VGG16 feature map, i = j = 1,2,3….,7, and ( , ) denotes the pixel ( , ) value of the upsampled Alexnet network feature map for the kth-channel.
Maximum fusion (MF): This method compares the corresponding pixels of the same dimension in two feature maps and selects the largest one as the fused pixel. The maximum fusion formula is: Improved Wavelet transform fusion (IWTF): This scheme performs wavelet transformation on two original images, transforms them into high-frequency and low-frequency image signal components, then fuses these components of different feature domains to obtain a new wavelet tower. Finally, it performs fusion transformation through an inverse wavelet.
Wavelet transform fusion (WTF) manages a very appealing reconstruction ability ensuring no information loss and redundant information in the signal's decomposition process [50]. Let the coefficients of images A and B be ( , ) and ( , ) after i-layer wavelet decomposition, and the coefficients corresponding to the image after fusion be ( , ). represents the low-frequency coefficient of the image in the i-th layer and the high-frequency coefficient of the image in the direction of the i-th layer. The low-frequency information includes the image's outline, and the highfrequency information includes the image's details. The traditional wavelet transform fusion method uses weighted average fusion at low frequencies (Eq 13) and employs the most considerable absolute value of coefficients at high frequencies (Eq 14). Finally, ( , ) indicates the coefficient location. In Eq 15, F represents the total fusion feature map of size 512X7X7, with k = 1,2,3,…,512 representing the feature map channels of size 7X7 per channel. After the feature map fusion process completes, its dimension is preserved. and denote the feature map under the k-th channel for VGG16 and Alexnet, respectively. The corresponding feature map channels generated by VGG16 and Alexnet undergo a wavelet transform fusion as follows: This paper employs the db4 wavelet to decompose the original image involving three wavelet layers and obtains the image's high and low-frequency coefficients. The low-frequency coefficients are fused through a weighted average scheme, while for the high-frequency coefficients, the Canny operator is applied to perform edge detection, extract the edge area information, and reduce subsequent image fusion data. In the edge area, the area energy selects the high-frequency coefficients [51][52][53], and finally, the fused wavelet coefficients are subjected to wavelet inverse transformation to realize fusion. The edge region extracted by the Canny operator is divided into MxN regions, which in this work is M = N = 2. Then, we employ Eq 16 to find the average wavelet energy of each area block, and finally, Eq 17 to determine the high-frequency coefficient. The fusion flow chart utilizing Wavelet transform is illustrated in Figure 7.
where , represent the average wavelet energy and ( , ) , ( , ) the high-frequency coefficients of the images A and B in the current area block, respectively. Eq 17 expresses the weighted addition between the high-frequency coefficients of images A and B and the average energy，while the high-frequency coefficient after fusion.

K-fold cross-validation
K-fold cross-validation is a way to build a model and verify its parameters when fewer data sets are utilized in a deep learning scheme. The sample data are combined into different training and validation sets, where the training set trains the model, and the validation evaluates the model's accuracy, preventing the model from overfitting [54]. K-fold cross-validation divides the sample data into K random subsets, where K-1 subsets are employed as the training set, and the remaining one is the validation set. Since the sample data is divided into training sets, there are K choices, and thus the training and the verification errors need to be calculated each time. Finally, the K calculations of the model's training and verification errors are averaged to obtain the cross-validation errors [55] that are ultimately used to evaluate the model's performance. Figure 8 presents the K-fold cross-validation graph. In our research, we set K = 5.

Transfer learning and data augmentation
Transfer learning is used in deep learning to solve model overfitting and the poor robustness caused by insufficient or few data sets [56]. Transfer learning allocates parameters generated in the model training process under one data set to model another by realizing parameter sharing. This work considers the VGG16 and Alexnet as the basic pre-trained models for object detection on the ImageNet dataset. During this pre-training period, millions of parameters are learned to obtain standard visual features fed to our convolutional neural network. However, in the proposed dual-CNN model fusion framework, we freeze the feature layer parameters of the VGG16 and Alexnet models, extract the feature layer features, and train the parameters of the classifier block.
During the CNN training process, exploiting data sets with only a few samples per class imposes the model to overfit and reduces its test accuracy and generalization ability. In this work, the aluminum profile data set exploited originates from a factory that uses a digital camera for image collection, and thus the number of data sets is insufficient. Therefore, data enhancement is applied to the original data set to expand the data set and improve the model's robustness [57,58]. Data augmentation methods include dimming, horizontal rotation, vertical rotation, and noise addition (Gaussian noise and salt and pepper noise). Adding noise simulates low image quality due to external factors during the actual image acquisition, transmission, and storage. we mainly use horizontal and vertical rotation and salt and pepper noise. The horizontal and vertical rotation involves 180-degrees rotation from left to right and bottom to top, respectively. Considering noise, the signal-to-noise ratio is set to 0.95, 0.9, 0.75. This article has 3568 images enhanced by horizontal rotation, while for the experiments, we create 3573 vertically rotated images and 3571 images with salt and pepper noise. Examples of the three data enhancement methods are illustrated in Figure 9.

Experimental setup
This section mainly introduces the visual acquisition equipment, model training environment, experimental data, and the qualitative evaluation indicators involved in the experiment process. Figure 10 illustrates the designed image capturing device, including an ABB120 robotic arm, light shield, LED strip light source, background board, camera, and conveyor belt. The hood (0.25m × 0.25m × 0.65m) is designed to create a suitable lighting environment and avoid substantial light interference during image capturing. The LED strip light source with an adjustable brightness improves the image surface collection effect. Experimental tests have proved that choosing the orange color for the background plate can enhance the contrast between the image and the background. The image acquisition equipment uses a Hikvision industrial camera (model: MV-CE060-10UM) with a resolution of 3072 × 2048, which is installed 60mm under the hood. The image acquisition process is: the robotic arm utilizing an end effector grabs the aluminum profile workpiece (known position) and places it horizontally under the camera. The trigger time is set to capture the first image, and then the robotic arm changes the posture and position of the aluminum profile into a known orientation to capture the second image. After that, the end joint rotates 180 degrees to capture the third image, and in total, three images per aluminum profile workpiece are captured. Finally, the aluminum profile workpieces are sorted, the robotic arm places them on the conveyor belt, and the PLC controls the conveyor belt to move them to their designated position. Figure 11 presents the collected surface image of the aluminum profile workpiece. Side 1 is the first image taken horizontally under the camera when the robot arm grabs the aluminum profile workpiece. Side 2 is the second image taken after the robot arm changes its posture, while Side 3 refers to the third image after the end joint rotates 180 degrees.

Experimental development environment and dataset
For the trials, we utilize the Tensorflow deep learning framework. The CNN models are trained employing an NVIDIA GeForce GTX 1060 6GB GPU, and the software environment is python 3.8.3. The CNN model employs the Adam optimizer with a learning rate and learning rate decay of 0.001 and 1e-5, respectively, a batch size of 16 and 50 epochs for the network training. The remaining parameters are the default ones of Tensorflow.
The experiment exploits the images collected from the aluminum profile workpiece utilizing the image acquisition device of Figure 10. The entire data set includes three single defect sample types and one non-defect sample type, which adopt the classification categories of. Specifically, in Figure 12(a), the surface is smooth and flat, i.e., the Intact class. In Figure 12(b), an external force affects the surface, and the damaged area is large, i.e., Bruise class. The small dirty spots on the surface in Figure 12(c) are Dirty spots (DS) class. In Figure 12(d), irregular scratches appear on the surface that are unevenly distributed, which is classified as a Scratch. Table 1 presents the number of samples in various categories. The dataset contains 14282 surface images, including the original collected images and the images after data augmentation. The data set is divided into a test set (1430 images) and a training set (12852 images) following a 1:9 ratio. The training set undergoes a 5-fold crossvalidation process during which the training samples are10280 and the verification samples are 2572.

Quantitative evaluation metrics
In the training and verification phase, the training effect is monitored by displaying the classification accuracy (CA) and cross-entropy loss (CEL) changes in real-time [59,60]. In the testing phase, the robustness and generalization of the model are verified through indicators such as confusion matrix, ACC, PPV, TPR, and F-score. All indicators are described in detail below.
Classification Accuracy (CA): The ratio of the correctly predicted samples to the total number of actual samples. The subsequent trials consider Training accuracy, Validation accuracy, and Test accuracy.
Cross-Entropy Loss (CEL): Assesses the gap between the actual and prediction classes. The experiments involve Training loss and Validation loss.
Confusion matrix: The predicted results of all categories and the real results are placed in the same table based on their category. This table highlights the number of correct and incorrect identifications per category.
We use the positive predictive value (PPV) or precision, true positive rate (TPR) or recall, F-score, and accuracy (ACC) to evaluate the model's performance [59]. These metrics are calculated utilizing Eqs 18-21, respectively. For class x, TP X represents the number of correctly predicted x, PPV X the number of correctly predicted x divided by the total number of predictions belonging to x, and TPR X is the number of predicted x divided by the total number of actual x. The F-score is utilized to combine PPV and TPR metrics into one metric using the harmonic mean, which is detailed in Eq 20. Finally, Eq 21 shows the ACC definition, where n represents the number of categories and the number of each category.

Experiments results and analysis
This section describes and analyzes the experimental results. During the training process, the Kfold cross-validation method determines the best model through analysis by evaluating all models' performance (Section 4.1). We combine various interpolation and feature fusion methods for model training and testing and then analyze them to determine the optimal combination (Section 4.2). We also compare the performance indicators of various fusion frameworks during the training and testing process (Section 4.3), and finally, we analyze the current research deficiencies and propose future research directions (Section 4.4.).

K-fold cross-validation and performance evaluation
The VGG16 model's maximum pooling layer of the V5 part and Alexnet model's A4 part were selected as the feature map fusion positions during the experiments. After convolution, using maximum pooling reduces redundant features and extracts texture features [61]. To upsample the feature map of the maximum pooling layer of the A4 part, we use weighted bilinear interpolation, while the wavelet transform fusion method is employed for feature map fusion. Table 2 shows the classification accuracy (CA) and cross-entropy loss (CEL) values of each fold of our proposed fusion scheme. According to Table 2, the average CA and CEL values during training are 0.977 and 0.109, respectively, while the corresponding validation image data set's results are 0.970 and 0.124, respectively. During training, the best efficiency is attained in the fourth fold of the 5-fold process. Figure 13 illustrates the change curve of CA and CEL during the 4-fold training. Specifically, when the epoch is greater than 20, the gap between the verification accuracy and the loss curve tends to stabilize, and there is no significant change in the accuracy and loss values as the loss slowly decreases throughout the training process. The model's CA in the training and validation data sets are 0.983 and 0.971, while CEL is 0.097 and 0.116, respectively. Throughout the experiments, the same crossvalidation method is used to validate the competitor models. Section 4.2 introduces in detail the experimental results by evaluating three interpolation methods and four feature fusion schemes.  Figure 13. the accuracy and loss curve of WBI and IWTF combination during the training process.

Comparison of different interpolation methods under different feature fusion methods
This section evaluates three interpolation and four feature fusion methods, which are crosscombined, and each combination is applied to the proposed dual-CNN model fusion framework. The fused feature map is input into the classifier block for classification. Under the same experimental conditions, the performance of the models for each combination is compared in the training and test set.
We use CA and CEL during the training process to analyze the model's performance under different combinations, with the specific data shown in Table 3. The latter table shows the CA and CEL when the optimal model is obtained after five cross-validations of different combinations during the training process. Figure 14 shows the change trend curve of CA and CEL when the optimal model is obtained after cross-validating different combinations during the training process. Table 3 and Figure 13 highlight that the combination of weighted bilinear interpolation and improved wavelet transform works best in the training set. The CA of the model in the training and validation data sets are 0.983 and 0.971, respectively, and the CEL is 0.097 and 0.116, respectively.  Comparing the Test accuracy of various combinations in the test set, Figure 15 highlights that the combination of the weighted bilinear interpolation and improved wavelet transform works best in the test set, managing a Test accuracy of 0.951. Analyzing the experimental effects on the various combinations examined indicates that the improved wavelet transform combined with any interpolation method affords better performance than any other feature fusion method. Nevertheless, in most cases, the weighted bilinear interpolation has higher experimental accuracy.

Comparison of the proposed fusion frameworks with other fusion frameworks
This section challenges the proposed dual-CNN model fusion framework against the other two fusion frameworks presented in Section 2. All feature fusion methods are applied under the same experimental conditions as mentioned in the previous trials. Figure 16 presents the test accuracy of the various fusion frameworks. Among them, the proposed dual-CNN model fusion framework and the other two frameworks presented in Section 2 have the highest accuracy when combined with the improved wavelet transform fusion strategy, achieving an accuracy of 0.951, 0.944, and 0.939, respectively. For the data set utilized in this paper, we also compare the test accuracy of the three fusion frameworks under the same fusion method. In most cases, the proposed dual-CNN model fusion framework manages a higher accuracy than the other two fusion frameworks. Each column of the confusion matrix represents the predicted category, and the total number of data per column is displayed as the predicted number of that category. Each row represents the true attribution category, with the total number of data per row representing the number of that category.
The value in each column shows the actual amount of data predicted for that type. We evaluate the three fusion frameworks (the total number of images is 1430) employing the improved wavelet transform fusion strategy on the test set. The latter set involves 369 images of the Intact category, 233 of the Bruise category, 436 of the DS class, and 392 of the Scratch class. Figure 17 depicts the difference in the confusion matrix under the three fusion frameworks. From Figure 17, we find that some scratches are easily misclassified as Intact. The reason may be that the scratches on the surface of the aluminum profile are not noticeable. Comparing the three confusion matrices of Figure 17, it is evident that the test accuracy of our proposed dual-CNN model fusion framework is higher than the other two fusion frameworks.  Table 4 presents the accuracy rate (ACC), average PPV, average TPR, and average F-score of the three different fusion frameworks combined with the improved wavelet transform fusion strategy. The average PPV, average TPR, and average F-score represent the corresponding average metric over all categories. According to Table 4, the accuracy of our proposed architecture is higher than the other two modular fusion frameworks. The accuracy rate is 0.951, the average PPV is 0.949, the average TPR is 0.950, and the average F-score is 0.949. As a recap, Figures 1, 2 and 3, present the three competitor fusion architectures. In the proposed scheme (illustrated in Figure 3), we select the feature map of the largest pooling layer of the Alexnet model A4 and VGG16 model V5 for feature fusion. Then, we use global average pooling instead of the dense layer to build a classifier block affording fewer training parameters and reduce the model's space dimensionality to avoid over-fitting and improve classification accuracy. The performance difference between our method and the first fusion framework is mainly due to the different feature fusion positions and classifier blocks. Part of the features fused by the second fusion framework originates from the processed image, i.e., gradient processing. Figure 18 illustrates the accuracy metrics between the dual-CNN model fusion framework and the single convolutional neural network framework. The latter figure indicates that the experimental accuracy of the dual-neural network after feature fusion is higher than that of the single neural network. Among them, the test accuracy of Alexnet is 0.908, and of VGG16 is 0.926. After feature fusion, the performance of the two convolutional neural networks is better, managing a test accuracy rate of 0.951, which is 0.043 higher than solely using the Alexnet model and 0.025 higher than the VGG16 model. Comparing the three single-convolutional neural network frameworks and the other two traditional dual-CNN model fusion frameworks, the experimental accuracy of our dual-CNN model fusion framework has been improved. Figure 18. Comparison of test accuracy between dual-convolutional neural network model fusion framework and single convolutional neural network framework.

Insufficient research and future work
The suggested dual-CNN model fusion framework affects the surface defect recognition and classification of aluminum profile workpieces. It is important to note that the achieved experimental accuracy meets the requirements. However, it is limited to the identification and classification of a single defect. When multiple defects with inconsistent sizes on the aluminum profile workpiece surface exist, the dual-CNN model fusion framework will classify it according to the learned defect feature ratio and take the largest ratio as the classification output. This may lead to incorrect classification results. Figure 19(a) shows the classification result when multiple defects exist. In this example, the scratches on the surface of the workpiece are evenly distributed, and the size is larger than the dirty spots, so the scratches are output as the final classification.
Future work should also include other single-convolutional neural network frameworks for feature fusion to form a multi-convolutional neural network fusion framework. This strategy mainly realizes simultaneous recognition of multiple defects and defect marking positions to segment and highlight the defective parts. Figure 19(b) shows the location of multiple defects on the surface of the workpiece. The red squares represent dirty spots, and the green ones represent scratches.

Conclusions
This work considers aluminum defect detection and classification. Specifically, we propose an improved dual-CNN model fusion framework to extract different features of the same input source exploiting the pre-trained VGG16 and Alexnet models. Weighted bilinear interpolation ensures that the feature map generated by the last maximum pooling layer of the Alexnet and VGG16 models have the same dimensions. The improved wavelet feature fusion strategy is exploited to fuse feature maps effectively, while global average pooling replaces the dense layer to construct the classification block, i.e., classify and recognize the aluminum profile's surface defects.
Additionally, we analyze the structure of the conventional dual-CNN model fusion framework and the hidden layers' role. We also challenge several traditional upsampling methods combined with feature fusion strategies and select a set of optimal combinations (improved bilinear interpolation and improved wavelet transform fusion) as the configuration of the framework proposed in this paper. During the experiments, data augmentation and transfer learning methods are employed to prevent overfitting, and the K cross-validation method is used to evaluate the performance of the experimental model during the training process. Finally, we challenge the proposed framework against traditional dual-CNN model fusion frameworks and single-convolutional neural networks under the same experimental conditions. Among them, the classification accuracy of our framework on the test set is 0.951, while the two conventional dual-CNN model fusion frameworks are 0.944 and 0.939, VGG16 manages 0.926, Alexnet 0.908, Inceptionv3 0.922, VGG19 0.929, Resnet50 0.915, and Resnet101 0.921. The experimental results highlight the contribution of exploiting an improved wavelet fusion strategy to achieve feature fusion after the maximum pooling layer of the two models. Additionally, the experimental results indicate the effectiveness of employing global average pooling instead of a dense layer to build the classifier block.