Multimodal Fusion Convolutional Neural Network with Cross-attention Mechanism for Internal Defect Detection of Magnetic Tile

The internal defect detection of magnetic tile is extremely significant before mounting. Currently, this task is completely realized by manual operation in the magnetic tile manufacturing industry, which results in inefficiency and diseconomy. In this work, we develop an intelligent system based on the acoustic sound for internal defect detection of magnetic tile to overcome these drawbacks. Due to the non-Gaussian and non-stationary characteristics of the acoustic sound, adopting the single modality of the data for internal defect detection of magnetic tile cannot achieve good accuracy. Therefore, we design a multimodal fusion convolutional neural network (MMFCNN) for internal defect detection of magnetic tile. We train the network in an end-to-end way. Our proposed MMFCNN consists of three blocks, i.e., feature extraction block, feature fusion block and internal defect detection block, whose purposes are to extract features from generated modal data, fuse multimodal feature maps and analyze whether the magnetic tile has internal defects, respectively. Moreover, to realize the information interaction and emphasize more representative information at feature extraction stage, we propose a novel attention mechanism, i.e., cross-attention mechanism. Extensive experimental results demonstrate our proposed MMFCNN is effective for internal defect detection of magnetic tile. Our code is available at https://github.com/Clarkxielf/Multimodal-Fusion-Convolutional-Neural-Network-for-Internal-Defect-Detection-of-Magnetic-Tile.


I. INTRODUCTION
Magnetic tile is a kind of arc permanent magnet, which is the core component of the permanent magnet motor [1]. With the rapid development of automation technology, abundant component magnet motors are widely used in automation equipment and intelligent device. Therefore, its quality plays a decisive role in the performance and service life of electromechanical products. As a kind of clean and cheap energy, magnetic tile has not only a wide variety but also growing global market demand, especially in the field of electric vehicles. The defects of magnetic tile are mainly divided into two categories: external defects and internal defects. As so far, researchers have proposed many methods to detect external defects based on machine vision technologies [2], [3]. On the contrary, internal defects are invisible, bringing new challenges to their detection. There are many factors causing the internal defects of magnetic tiles, such as uneven raw materials, thermal shock and rapid cooling in the production process. Currently, internal defects of magnetic tile are identified by experienced workers through listening intently to the excited sound in the magnetic tile manufacturing industry. But such process is extremely risky. For the magnetic tile manufacturers, once the internal defects of the sold magnetic tile are detected by the user, this batch of magnetic tiles will be scrapped and recycled, which would cause serious economic losses. More seriously, if the magnetic tile with internal defects is used, it is likely to cause safety accidents and casualties. Therefore, developing an automation system to detect internal defects is becoming increasingly urgent in the magnetic tile industry.
Nowadays, with the development of science and technology, abundant non-destructive testing technologies have been successfully developed for internal defects by scientists around the world, such as ultrasound, infrared imaging, acoustic emission and X-ray diffraction tomography [4]. Although these technologies have attained great success and are widely applied in many non-destructive testing scenarios, they are too costly to operate in an automatic way to match the agile manufacturing process for different kinds of magnetic tiles. Inspired by the manual operation in the magnetic tile manufacturing industry, we utilize the acoustic sound for internal defect detection of magnetic tile. This is because the characteristics of the acoustic sound for an object are closely linked to its physical structure vibration.
However, the acquired acoustic sound is generally nonlinear, non-Gaussian and non-stationary, which can seriously hinder the extraction and identification of the signal features regarding internal defects, whereas these meaningful features are usually too weak to be discovered [5]. Therefore, many algorithms are proposed to process acquired acoustic sounds, such as wavelet packet analysis (WPT) [6], hidden Markov model (HDM) [7], principal component analysis (PCA) [8] and variational mode decomposition (VMD) [9]. But those algorithms need to design hand-crafted features, which requires complex mathematical operations and a certain understanding of the extracted signals as well as a wealth of signal processing knowledge. More importantly, those specially designed hand-crafted features generally work well for specific signal and fault scenarios and are probably not applicable for diverse types of time-series and different operating conditions. To address this issue, it is superior to design an end-to-end algorithm to analyze acoustic sounds of objects without much expert knowledge. Therefore, deep learning (DL) [10]- [13] is always a good choice for such a situation.
As a special machine learning model, deep learning techniques are structured by a stack of multiple layers of nonlinear processing units. It shows excellent performance in many fields, e.g., image classification [14], target recognition [15], semantic segmentation [16], natural language processing [17], machine translation [18], and so on. Compared with the traditional machine learning algorithms, DL techniques are capable of intelligently learning underlying features from large and diverse data, which escapes from the dilemma of hand-crafted feature design. Especially, convolutional neural networks (CNNs) are the most widely used to extract meaningful features. From AlexNet [19] to ResNet [20], the depth of CNNs becomes deeper and deeper, and the number of parameters becomes larger and larger. AlexNet uses Rectified Linear Unit (ReLU) [21] to replace the traditional activation function to solve the gradient dispersion problem, and adopts Dropout to prevent overfitting of the model. VGG [22] stacks multiple small convolution kernels to replace a large convolution kernel, which can significantly improve the learning ability of the network. This is because the nonlinear ability of multiple small convolutional kernels is stronger than that of a larger convolutional kernel. GoogleNet [23] performs multiple convolutional operations with different kernel sizes on features in parallel to learn multi-scale representation information. ResNet introduces the residual shortcut to solve the gradient disappearance problem of deep network, which strengthens the information interaction between adjacent residual blocks. Later, more and more lightweight CNNs [24], [25] are proposed to reduce the inference time of the model without compromising the performance. Although the aforementioned CNNs show good performance in classification tasks, it is not applicable to the internal defect detection of magnetic tile based on the acoustic data because of the non-Gaussian and non-stationary characteristics of the acoustic sound. Moreover, the unknown size, shape and location of internal defects also increase the difficulty in extracting effective features embedded in acoustic sound. Therefore, only extracting features from time-domain acoustic sound cannot completely characterize internal defects of the magnetic tile since the acoustic sound in time domain only reflects the fluctuation of sound energy over a period of time.
Currently, researchers have done a lot of studies on the classification of multimodal fusion based on deep learning for the time-series signal [26]- [32]. According to different inputs, multimodal fusion is divided into two categories. The first one is that inputs include various signals, i.e., voice, text, image or data from different sensors. For example, Wang et al. [32] proposed a new deep learning-based prognostics framework for predicting the remaining useful life of machinery, which utilizes monitoring data from different sensors as the inputs of the prognostics network so as to integrate the complete degradation information. The other is that inputs are the multiple transformed signals of one signal. Ahmad et al. [29] proposed two efficient multimodal fusion networks for electrocardiogram (ECG) heart beat classification, whose inputs are images of Gramian Angular Field, Recurrence Plot and Markov Transition Field. Liang et al. [31] proposed a new methodology of parallel convolutional neural network (P-CNN) for bearing fault identification, which is capable of extracting features from time domain and time-frequency domain of the raw vibration signal. However, mostly previous works only simply stack extracted features of each modality for fusion, without considering the degree of importance among features. Although few researchers assign weights to the features of each modality, they ignore the differences of cross-modal features.
Therefore, based on the latter fusion method, a novel CNN framework termed MMFCNN is proposed for internal defect detection of magnetic tile in this article. Its inputs are signals of the raw time domain, frequency domain gained by fast Fourier transform (FFT) and time-frequency domain yielded by spectrogram transform. In the proposed MMFCNN, a multi-branch feature extraction strategy is developed to learn high-dimensional representations from time domain, frequency domain and time-frequency domain of the acoustic data. Then, the cross-attention mechanism is proposed into MMFCNN to realize the information interaction among each This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3180725 branch and emphasize more representative features at feature extraction stage. Next, feature fusion block integrates the highlevel information from feature extraction block. Finally, the integrated representations are fed into an internal defect detection block for internal defect detection of magnetic tile. The main contributions of this article can be summarized as follows. 1) MMFCNN architecture is proposed by three parallel CNN branches, which can extract efficient representations from three different domains of the raw acoustic data. 2) Cross-attention mechanism is proposed into the MMFCNN to focus on more important features and interact information among branches. 3) An intelligent system for internal defect detection of magnetic tile is developed. This system can automatically and efficiently classify magnetic tiles. The detection speed of the system shall reach 40 magnetic tiles per minute at least. Therefore, it has great practical value for the magnetic tile industry. The rest of the article is summarized as follows. In Section Ⅱ, details of the proposed MMFCNN are elaborated. In Section Ⅲ, the method of internal defect detection of magnetic tile is given. Section Ⅳ presents the experimental results, and Section Ⅴ concludes this article.

II. PROPOSED MMFCNN
In this work, our goal is to compose an intelligent network, which is capable of disclosing the mapping relationship between acoustic data and defect labels (whether there are internal defects in magnetic tiles). However, due to the acoustic sound being non-Gaussian and non-stationary, adopting the single modality of the acoustic sound for internal defect detection of magnetic tile cannot achieve good accuracy. To overcome this problem, we design a multi-branch neural network to extract features from time domain, frequency domain and time-frequency domain.
The architecture of the proposed MMFCNN is illustrated in Fig. 1, which consists of feature extraction block, feature fusion block and internal defect detection block. The raw acoustic data collected by a sound acquisition sensor, are transformed by Fourier transform and spectrogram. The raw acoustic data together with two kinds of transformed data are first input into the feature extraction block to learn multidimensional representations. Meanwhile, the highdimensional representations are fed into feature fusion block to fuse the differently useful information of multimodal data. Finally, we input the fused features into the internal defect detection block to analyze whether the magnetic tile has internal defects. The details of MMFCNN are described as follows.

A. FEATURE EXTRACTION BLOCK
The feature extraction block is structured by three streams and each stream consists of several CNN layers. And the architecture of each stream in MMFCNN is shown in Table Ⅰ. In particular, to emphatically concern the important information and effectively fuse the complementary features, the cross-attention mechanism is established behind the convolutional module.
1) Convolutional Module: In the constructed MMFCNN, the architecture of the convolutional module is established by a series of CNN layers. The convolutional layer firstly utilizes several learnable convolutional kernels to convolve the input data, and then, applying an elementwise nonlinear activation function on the outputs of convolution operations. To avoid the overfitting of this model, batch normalization (BN) [33] is implanted in the convolutional layer. Through those three operations, different feature maps can be obtained in a convolutional layer. Mathematically, it can be expressed as follows.  Due to the signals in the time domain and frequency domain being one-dimensional (1-D) sequences, so 1-D convolution is used to extract features. On the other hand, after the signal is transformed by spectrogram, the output is an RGB image containing time domain and frequency domain information. Thus, 2-D convolution is utilized to learn the representation of the spectrogram. The principle of one-dimensional convolution kernel and two-dimensional convolution is the same. For convenience, we denote the time domain, the frequency domain and the time-frequency domain, respectively, as T D , F D and ) and where l b denotes the bias, denotes the convolutional operation, and C represents the number of input channels.
To extract the main features of convolutional operation and increase the receptive field, the pooling layer is optionally used behind the convolutional layer. As an independent neural layer, the pooling operation has no parameters, and is used to filter out unnecessary characteristics and preserve vital representations. As a result, the obtained feature maps cover the significant information of the raw data. Mathematically, the nth feature map of the lth pooling layer l n y cloud be expressed by ( , , ) ll nn x is the nth output feature map of the lth convolutional layer, i.e., () pool is the max pooling operation, k is the pooling kernel size, and s represents the stride of the pooling kernel.
It is worth noting that the dimensions of the feature maps outputted by the convolutional module of three branches are not the same, which will bring difficulties to the subsequent information interaction and feature fusion. Therefore, we flatten the feature maps generated by the 2-D convolution module along the channel, and then do a 1-D convolution operation. The output feature map z F can be formulated as f denotes the flattening operation. 2) Cross-attention Mechanism: In essence, the attention mechanism in CNN is similar to the human selective visual attention mechanism, and the core goal is to select the information that is more critical to the current task from numerous information. Specifically, introducing an attention mechanism in CNN is to emphasize more representative features that are relevant to the internal structure of magnetic tile while restraining inessential information. On the other hand, to realize the information interaction among feature extraction branches, a novel module named the cross-attention mechanism is introduced into our designed MMFCNN. Moreover, due to the differences among cross-modal features, the cross-attention mechanism can make incompatible features align in fused feature space. As shown in Fig. 2, it consists of two blocks: channel-wise attention and feature interaction mechanism [34]. a) Channel-wise attention: In CNN, each channel of the feature maps is the activation response corresponding to the convolution kernel, and introducing channel-wise attention mechanism into CNN can be regarded as the process of selecting semantics [35], which learns the weight of each channel and improves the representation performance of convolution features by suppressing irrelevant features. In channel-wise attention, firstly, an adaptive average pooling is carried out behind the convolutional module. Then, it is forward into a multilayer perception (MLP) with two layers, which yields a feature vector. Last, the output feature vector is fed into the sigmoid activation function to obtain the channelwise attention vector. It can be calculated by ( represents the sigmoid activation function, () pool denotes the global average-pooling (GAP), and is the output feature map of the convolutional module. b) Feature interaction mechanism: The global feature of the single modality is crucial for classification. Introducing feature interaction mechanism is to take advantage of this global information, so that each branch contains important information of other branches. Given feature maps ,,

B. FEATURE FUSION BLOCK
As mentioned above, a single acoustic sound cannot well realize the task of internal defect detection of magnetic tile. Therefore, a feature fusion strategy that can make use of the differently useful information of the generated modal data, is embedded into our proposed MMFCNN. For multi-dimension feature maps, there are many ways for multimodal feature fusion, including max, mean, sum and concatenation operation.
where cat is the output of the concatenation fusion operator of three branches feature maps, () is the concatenate operation, x F , y F and z F represent the output feature maps of cross-attention mechanism, and C is the number of channels.
For sum fusion operator, it calculates the sum values of feature maps at the same spatial locations. Since the importance of each modal feature map is unclear, we assign learnable parameters , , to feature maps of three generated modal data, which represents the weight of each feature. Mathematically, it can be expressed as , , where a represents the spatial location and c is the cth channel of feature maps. The detail of learnable parameters , , is described in Section Ⅳ (D).

C. INTERNAL DEFECT DETECTION BLOCK
The specifically designed internal defect detection block consists of four fully connected layers (FCLs). And these four FCLs contains 2048, 512, 128 and 2 neurons, respectively. The first three fully connected layers are associated with the Dropout and ReLUs. The feature map of feature fusion block is then flattened to be fed to four FCLs.

D. LOSS FUNCTION
Essentially, the internal defect detection of magnetic tile is a binary classification problem. Therefore, the binary crossentropy loss is chosen as the loss function. It is defined as ( log (1 ) log(1 )) p p p p (11) where p denotes the probability that the predicted result is a positive example (without internal defects), and p represents the label of the sample. If the sample is a positive example, the value is 1; otherwise, the value is 0.

A. SYSTEM SETUP
As shown in Fig.3, we designed an intelligent detection system for internal defects of magnetic tiles, which can automatically collect sound and send it to the computer for prediction, and then feedback the prediction results to the classification system to classify the magnetic tile with or without internal defects. This system consists of five parts, namely, transportation system, excitation system, sound acquisition system, internal defect detection system, and sorting device. The composition of each part is as follows: the transportation system consists of three parallel conveyor belts. The first two conveyor belts carry the magnetic tiles in an upright and transverse posture respectively, and the third conveyor belt transports the sampled magnetic tile to the designated position for sorting. The excitation system is essentially a mechanical arm, which is responsible for grasping the magnetic tile to about two centimeters height and then falling to collide with the iron block to generate sound. The sound acquisition system

FIGURE 3. Scheme of internal defect detection system.
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3180725 is a data acquisition card with a microphone. The internal defect detection system is an application software, and its detection process is to call the prediction program based on MMFCNN. Last, the sorting device is composed of two cylinders, which remove broken magnetic tiles (the magnetic tile with obvious internal crack is easy to be broken after colliding) and magnetic tiles with internal defects, respectively.
The working principle of this system is summarized as follows. At first, the mechanical arm goes down to grab the transverse magnetic tiles to a fixed height. Then, sound acquisition system collects the sound generated by the magnetic tile colliding with the iron block after falling. Finally, the collected sound is input to the designed model for prediction, and the prediction results are fed back to the sorting device for classification.

B. DATASET CONSTUCTION
Before training a deep neural network, it is essential to obtain data and label the corresponding labels. However, the internal defects of magnetic tile are not as obvious as the surface defects. In industry, the internal defect detection process in magnetic tile mainly depends on the hearing of experienced workers. They distinguish magnetic tiles with internal defects from the sounds generated by the magnetic tile colliding with the iron block.
To realize the trained model can be well applied to the designed equipment, we use the designed device to sample the acoustic data of the magnetic tiles, which are labelled by experienced workers in advance. As for sampling parameters, the sampling frequency is set to be 40 kHz and 7000 data points are recorded for each sound. In the end, we obtained 1241 magnetic tile samples, including 730 samples with internal defects and 511 normal samples. The split of the dataset is shown in Table Ⅱ. Furthermore, the sample with the internal defect was labelled as "Defective", on the contrary, it was labelled as "Normal".

C. DATA PROCESSING
Data processing is critical to the model training. In this work, there are three main data processing methods, i.e., data normalization, FFT and spectrogram transform. Data normalization is helpful to adjust the learning rate and accelerate the convergence speed. And the data transformed by FFT and spectrogram will be used as the input of the proposed MMFCNN together with the raw acoustic data. The details of these data processing methods are as follows.
In this work, we adopt the min-max normalization method.  (15) where M is the length of the frame. To show the importance of each frame of data, we calculate the energy of each frame with the following formula.
where [] i xn represents the ith frame of the acoustic sound.

IV. EXPERIMENTAL RESULTS
The proposed MMFCNN is trained on 4 NVIDIA GeForce RTX 2080ti GPUs using PyTorch, a deep learning framework. The initial learning rate is 0.01 and decays by a factor of 10 every 20 epochs at the last 40 epochs. The synchronous SGD optimizer is adopted with weight 1e-5, momentum 0.9. The total epochs are 200 and the size of mini-batch is 32. As a binary classification problem, the mapping relationship between acoustic data and internal defects of magnetic tile is relatively simple. The number of collected samples is sufficient to achieve good generalization performance, so no data augmentation technique is used. To make experimental results be more persuasive, each network is run five time. Then, the final results are presented through the mean and standard deviation. In comparative experiments (B, C, D), we don't add any attention mechanism.

A. DATA DIFFERENCE
As shown in Fig. 4, it shows the visualization of the acoustic data of the normal and defective magnetic tiles in three kinds of domains, respectively. In the time domain, the signal represents the fluctuation of sound energy over a period of time. As can be seen from the first column in Fig. 4, it is quite difficult to distinguish the difference of acoustic data between the normal and defective magnetic tile. This situation is not conducive to achieving accurate classification. Then, we convert the time domain signal to the frequency domain and time-frequency domain. Because the signal after FFT is symmetrical, we only take advantage of half of the data to avoid information redundancy. In the frequency domain space, the signal shows the distribution of frequency of each component wave. For defective and normal magnetic tiles, the dominant frequencies of their sound signals are mainly distributed around 7500Hz, 12000Hz and 16500Hz. However, for defective ones, the curve of the frequency domain signal contains more small peaks than that of normal ones. These small peaks are caused by the magnetic tile with internal defects. The spectrogram shows the distribution relationship between energy and frequency of the acoustic sound. As can be seen from the third column in Fig. 4, there are several bright lines in the spectrogram, which represent multiple dominant frequencies of the acoustic sound and high-energy areas. By comparison, the color of the area near the bright line in the spectrum of defective magnetic tile is brighter than that of normal magnetic tile, which corresponds to the distribution of sound signal in frequency domain.

B. COMPARISION WITH CLASSICAL CNNS
In this article, to demonstrate the superiority of our proposed MMFCNN, we compare our model with three famous networks, i.e., AlexNet, VGG-16 and ResNet-18. These models show state-of-the-art performance in the field of image classification. Besides, three generated MMFCNNs (MMFCNN-A, MMFCNN-V and MMFCNN-R) are compared, whose backbones are aforementioned three networks. Table Ⅲ summarizes the performance comparison results of the proposed MMFCNN and the aforementioned CNNs in the internal defect detection of magnetic tiles. As shown in Table Ⅲ, the accuracy of the proposed MMFCNN is much better than that of aforementioned CNNs, whose accuracy rate reaches 98.16%. While, the maximum accuracy rate of aforementioned CNNs is 97.68%, which demonstrates that only extracting the characteristics of sound signal in time domain cannot achieve good results in predicting the internal defects of magnetic tile and our proposed MMFCNN is relatively superior. Besides, the deeper CNNs are, the higher the accuracy rates cannot be significantly improved.

C. EFFECTIVENESS OF FEATURE FUSION
In this article, to verify that each modal data contributes to the internal defect detection of magnetic tile and feature fusion is effective, seven kinds of architectures are compared. For simplicity, the time domain, frequency domain and timefrequency domain are referred to as T, F and T-F respectively, and all combinations between them are also obtained, i.e., T+F, T+T-F, F+T-F and T+F+T-F. For the data in T, F and T-F, they are input to the single CNN for training. Moreover, these data in T+F, T+T-F and F+T-F are respectively fed into MMFCNN with two branches for training. Correspondingly, the data in T+F+T-F are input to MMFCNN with three branches for training. For the last four cases, these networks all use concatenation operation as the way of feature fusion. Table Ⅳ summarizes the performance comparison results. As shown in Table Ⅳ, the highest accuracy rate is 98.16%, whose architecture uses three modal data as input. As can be observed from Table Ⅳ, the accuracy rate of feature fusion is much higher than the single network, which illustrates feature fusion is effective. However, the effect of feature fusion between time domain and frequent domain is poorer than the single network in frequency domain. This is because the features in time domain and frequency domain are inconsistent, which leads to disorder of defect information through simple feature stacking. Another obvious result is that the more modalities of data are fused, the higher the prediction accuracy is, which shows each modal data contributes to the internal defect detection. This is because each modal data describes internal defects from a different angle. Moreover, it demonstrates that feature extracted from different modalities  can supplement extra information for internal defect detection of magnetic tiles.

D. FEATURE FUSION METHODS COMPARISON
In this experiment, feature fusion methods are explored. For these four fusion methods, the max operation extracts the most salient feature, the mean operation balances three types of features, the sum operation makes a suitable combination of features, and the concatenation operation integrates all defect features, which are independent of each other in the fused features. For sum fusion operation, we designed a subnetwork to regress three trainable weight parameters, i.e., , , , which were assigned to the feature maps of corresponding modes. The architecture of this subnetwork consists of four layers of processing units, i.e., a GMP and three convolutional layers. The GMP samples down the size of the feature map to 1. The subsequent three convolution layers with 1×1 kernel size, contain 256, 64 and 3 channels, respectively. Finally, three weight values are obtained through the softmax activation function. The experiment results are summarized in Table Ⅴ. As can be seen from Table Ⅴ, using max or concatenation operation for feature fusion achieves the best result, whose accuracy rate reaches 98.16%. The mean operation is a little better than the sum operation, the accuracy rates of which are 98.08% and 97.60%, respectively. The accuracy of these fusion methods is very close, which cannot explain which fusion method has a better effect. To illustrate the impact and generalization performance of these four fusion methods on MMFCNN, the training loss and the validation accuracy rates based on the aforementioned fusion methods are shown in Fig. 5(a) and Fig.  5(b). As shown in Fig. 5, our proposed algorithm based on these four fusion methods all converges after 200 epochs. Especially, the concatenation operation converges faster than others, and obtains higher accuracy on the verification set after convergence.

E. NECESSITIES OF CROSS-ATTENTION MECHANISM
In previous experiments, it was obvious that using three branches and concatenation fusion method achieved the best results. On this basis, the cross-attention mechanism is introduced into MMFCNN to demonstrate the effectiveness. In addition, the order of components may have a great impact on the effect of MMFCNN. Therefore, experiments on the order of the exchange of components were carried out. Experimental results are shown in Table Ⅵ. It can be observed, using channel-wise attention followed by a feature interaction mechanism (cross-attention) can more effectively improve the performance of MMFCNN, whose accuracy rate is 98.64%. This is because channel-wise attention enables the network to focus on the information of internal defects, and the feature interaction mechanism enables the network to associate information between two branches during feature extraction. By contrast, the result of using a feature interaction mechanism followed by channel-wise attention is slightly worse than the previous architecture. As a result of the locations of defect information in each modal data are different, and the features of one mode integrate the global features of another mode, resulting in the disorder of the channel-wise attention mechanism. However, using channelwise attention or feature interaction mechanism alone cannot improve the accuracy of our proposed algorithm. From the training process, they can accelerate the convergence of the network and make the network more stable. Moreover, the confusion matrix of the validation set has also been given in Fig. 6. For failure cases of prediction, there are two reasons. On the one hand, long-hours working leads workers to mistakenly mark samples. On the other hand, these failure cases may exist tiny internal defects that are acoustically close to normal magnetic tiles.

F. FIELD VERIFICATION
Before the deployment of the model, we analyze the time complexity and inference time of MMFCNN. For time

Confusion matrix of MMFCNN on validation set.
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  Table Ⅶ. To verify the adaptability of our proposed model to the whole detection system, we simulate the detection process of internal defects of magnetic tile, and build a similar and simple system. The NI-9250 sound acquisition card equipped with a sound sensor and the sound acquisition software system written by LabVIEW is used to sample the sound excited by magnetic tile and iron block in real-time. To avoid the influence of subjective factors, we test the newly produced magnetic tiles, and they are detected again by the experienced worker. Finally, the results of the two tests are compared. The comparison result is as follows. 100 newly produced magnetic tiles are tested through our established system. The test results show that 93 pieces are normal and 7 pieces have internal defects. The results of manual detection are consistent with ours. Therefore, this shows that our model has strong applicability.

G. INFLUENCE OF CONVOLUTIONAL PARAMETERS
Our backbone is based on AlexNet. The parameters in Table I are similar to AlexNet. For time domain and frequency domain data, because they are relatively sparse, small convolution kernels are used to extract neighborhood information. For time-frequency domain spectrogram, large convolution kernel is firstly used to extract a wide range of neighborhood information. Then, small convolution kernel is used to extract high-dimensional features. To demonstrate the influence of the different parameters, we mainly discuss the number of filters. The number of filters of five layers in our network is 1, 3, 6, 4 and 4 times that of the first layer. Keep the multiplier constant, and set the number of filters of the first layer as 32, 64, 96 and 128 to illustrate the influence of the number of filters on the network performance. And the corresponding parameters are marked as Conv1_X (32), Conv1_X(64), Conv1_X(96) and Conv1_X(128), respectively. The comparison results are shown in Table Ⅷ. As can be seen from Table Ⅷ, the number of filters in Table   I can make the network achieve the highest accuracy.

V. CONCLUSION
In this work, a novel deep learning-based CNN named MMFCNN was proposed for the internal defect detection of magnetic tile. Based on this algorithm, a new intelligent system was developed, which can automatically obtain sound and identify the internal defects of the magnetic tile. To take advantage of multimodal information, we utilized multiple branches to extract features respectively, and then, carry out the feature fusion operation. And then, multiple feature fusion methods were discussed. Moreover, the cross-attention mechanism was constructed to realize the information interaction among branches and emphasize more representative features, which can improve the performance of our model. As for whether each module in the cross-attention mechanism is necessary, several ablation experiments were carried out. Extensive experimental results show that our model is superior for internal defect detection of magnetic tile.
In this article, we assume the training set and test set follow the same distribution. Besides, the number of defective samples is comparable to that of normal samples. However, the number of normal magnetic tile is much more than that of the defective in the production process. On the other hand, our model is only for one kind of magnetic tile, which is not enough for the magnetic tile industry. Moreover, the proposed network is relatively large, which lead to the detection speed cannot be too high. Therefore, the related transfer learning, few-shot learning and knowledge distillation models need to be studied in our future work.