1 Introduction

In Industry 4.0 environments, the automation of industrial equipment supervision is required to maintain the desired level of performance. Through fault diagnosis, the current status of the equipment can be determined and abnormal conditions can be identified. Slow, inaccurate, or ineffective fault diagnosis may degrade the performance of the entire system and cause unexpected losses. Although real-time diagnosis models can allow abnormal conditions (faults) in equipment to be identified in the early stages, automatic fault diagnosis is always a challenging task because of the types of equipment used in industrial environments [1].

In general, various Internet of Things sensors, such as acoustic emission, vibration, current, voltage, thermal, and pressure sensors, can be used to monitor equipment. All the sensor data are one-dimensional time-series signals in which information is represented through the signal amplitude, phase, and frequency (as depicted in Fig. 2 in Sect. 2.1). These data are used to identify the type, size, and location of the fault. There are three basic steps in fault diagnosis models based on sensor data. Step 1 comprises data collection from the equipment, step 2 involves the application of signal processing techniques to extract features from the sensor data, and step 3 is the use of classifiers in the final stage [2].

Various researchers have studied transformed-signal processing methods for extracting uniform texture information from fault signals under various conditions. In [3], the discrete orthonormal Stockwell transform was applied to extract identical 2D patterns for each fault signal. Its high computation cost is infeasible for most applications while the time–frequency space partitioning leads to symmetry loss. In [4], bi-spectrum-based higher-order analysis was utilized to extract distinct signal patterns under inconsistent working conditions. Although the traditional bisector representation permits the phase information to be included and eliminates Gaussian noise, the results are unstable because of the randomly changing phase components of the signals. 2D acoustic spectral imaging [5] and local binary pattern (LBP) [6] techniques were applied for extracting uniform patterns from signals. The limitation of LBP is that identical LBP codes are generated for various structural patterns. Short-time Fourier transform (STFT) [7] and a Gabor filter with singular value decomposition (SVD) [8] were applied to acoustic emission signals for extracting the uniform texture patterns of the signals. However, the STFT and Gabor filter took too long to complete.

These state-of-the-art machine learning methods have certain drawbacks. For example, the development of a feature extractor requires domain expertise and skill in signal processing techniques, and the extraction function does not follow the same protocol in every application.

Deep learning (DL) models are used in fault diagnosis, where complex deep features are extracted from the raw data using several hidden layers without human intervention. Xia et al. [9] proposed a convolutional neural network (CNN) architecture for fault detection and evaluated its performance on the Case Western Reverse University (CWRU) bearing dataset and a gearbox dataset. In [10], deep neural network, deep belief network, and CNN were utilized for fault diagnosis and prognosis. The CNN did not encode the position and orientation of the objects and was invariant with respect to the input data. In [11], a novel fault diagnosis framework based on an end-to-end LSTM model was proposed to learn the features directly from multivariate time-series data and capture the long-term dependencies through the recurrent behavior and gates mechanisms of the LSTM. The RNN and LSTM were difficult to train because they required memory-bandwidth-bound computation in which the LSTM required four linear MLP layers per cell. Peng et al. proposed a model using 1D CNN and 1D residual blocks. The model was experimentally shown to work well despite the presence of strong noise and variable loads [12].

The preceding studies demonstrate that DL models can perform automated deep feature extraction from the raw fault data. However, the models have some limitations such as the increase in the number of parameters with the increasing number of layers and the computational burden of training large networks with huge parameters from scratch requires a massive amount of label data. The performance of these DL models is also affected by parameter optimization and hyperparameter tuning.

To overcome the limitations of DL models, transfer learning (TL) was incorporated into DL models so that knowledge from one problem can be used to solve a related different problem [4, 13,14,15,16,17,18,19,20]. For example, pre-trained ImageNet weights have been used to initialize the parameters in deep networks for classifying industrial faults instead of random initialization. Although the deep transfer learning (DTL) model parameters are initialized using pre-trained weights, a large training target dataset is still required for parameter optimization in complex architectures to improve their accuracy. Optimizing the parameter initialization is a vital consideration in ensuring the accuracy of DL-based fault diagnosis.

1.1 Literature review

TL models have recently been applied in machine fault diagnosis. In [13], a new temporal CNN with a depth of 51 convolutional layers was applied, with ResNet-50 trained using ImageNet as a feature extractor. Wen et al. used the VGG-19 architecture in [14] and Inception V3 and TrAdaBoost as feature extractors in [15]. Grover et al. [4] utilized bi-spectrum contour maps of the vibration signals in four pre-trained networks comprising Alexnet, VGG-19, GoogleNet, and Resnet-50. These architectures are limited by their large number of layers and high computational complexity. In addition, since TrAdaBoost depends on only a single source, its learning effects degrade when the source and target domains are weakly correlated.

Transfer component analysis (TCA) [16] and weighted transfer component analysis (WTCA) [17] have been used for fault diagnosis in rolling bearings under variable operating conditions. The large number of super-parameters in TCA/WTCA led to difficulties during model training.

In [18], a sparse auto-encoder with three layers was used to scrutinize the raw data and extract features. The maximum mean discrepancy was applied as the discrepancy penalty to be minimized between the source and target data. In [19], a VGG16 pre-trained network was used to extract the lower-level features and label wavelet transform images. Fan et al. [20] implemented TL in a CNN by generating texture images using empirical mode decomposition with the pseudo-Wigner–Ville distribution. The state-of-art DTL models used pre-trained ImageNet weights for implementing TL. Although there are 1000 object classes in the ImageNet dataset, the classes are indirectly related to the target fault domain. In addition, the pre-trained DTL architectures consist of significantly more layers and trainable parameters than conventional DL architectures.

1.2 Contributions

Motivated by the earlier work, Peng et al. [12], in this study, a texture-based 2D hybrid deep-CNN–LSTM architecture was investigated because deep LSTM layers with a deep CNN architecture can learn the features of fault signals adaptively. Similar to the approach in Hasan et al. [21], in this study, DL was implemented by splitting the datasets into two subsets comprising the source and target task datasets for various environments. The source task-dataset is used for training and validating the model to save the weights of the deep architecture and then the target task dataset is used for testing the classification accuracy using the weights of the source task dataset. Unlike state-of-the-art TL models, pre-trained weights were not used to initialize the architectures in this study.

The contributions of this study are summarized as follows:

  • First, we explain how Hilbert transform images based on analytical signals demonstrate invariant image patterns for fault signals using the benchmark fault datasets. In addition, Hilbert transform analytical imaging is compared with state-of-art methods comprising discrete wavelet transform (DWT), fast Fourier transform (FFT), and gammatone spectrogram-based texture approaches.

  • Second, a 2D DTL-based CNN–LSTM hybrid architecture for fault classification is presented. The proposed architecture is compared with conventional deep architectures such as the DCNN and deep LSTM. In our hybrid architecture, the additional LSTM blocks and DCNN can adaptively learn fault features more accurately and the complexity of the hybrid architecture is significantly lower than those of the state-of-art architectures used in TL models. For example, the ratio of trainable parameters in VGG16 to that in the proposed architecture is 12.65:1.

  • Finally, the effectiveness of the proposed hybrid TL model is evaluated in various environments with different noise, loads, and machines using audio records of machine anomalies in three public benchmark datasets of different sizes comprising the malfunctioning industrial machine investigation and inspection (MFPT) dataset for bearing vibration signals and the malfunctioning industrial machine investigation and inspection (MIMII) and toy anomaly detection in machine operating sounds (ToyADAMOS) datasets. The proposed hybrid architecture with TL demonstrated better performance for the vibration and audio fault datasets by achieving higher F1 scores with approximately five times less epochs.

1.3 Outline

The rest of the paper is organized as follows. In Sect. 2, the detailed architecture of the proposed hybrid DTL architecture is presented. The experimental results are discussed in Sect. 3. The paper is concluded in Sect. 4.

2 Proposed fault diagnosis methodology

The detailed steps in the proposed model are shown in Fig. 1. The model consists of three major blocks comprising dataset preparation with various loads, signal-to-noise ratios (SNRs), and revolutions per minute (RPMs); Hilbert transform 2D image generation; and a deep CNN–LSTM hybrid architecture as a classifier. In addition, to reduce the computational complexity, TL was implemented in the proposed model by dividing the datasets into the source and target task datasets. A brief description of each model block is presented in the following subsections.

Fig. 1
figure 1

Step-by-step representation of the proposed model

2.1 Test rig and data descriptions

Three standard public fault datasets comprising the MFPT vibration fault dataset and the MIMII and ToyADAMOS machine audio fault datasets were used to evaluate the proposed model. The MFPT dataset was collected from a NICE bearing with 0.235 roller diameter, 1.245 pitch diameter, eight elements, and 0° contact angle [22]. There are two environments in the MFPT dataset corresponding to loads of 50–150 lbs and 200–300 lbs and two classes of faults comprising inner and outer race signals, as summarized in Table 1. There are 429 samples with a sample size of 1024 for each type of signal in each environment.

Table 1 Details of various load environments in MFPT dataset

MIMII is an industrial sound dataset in which sounds corresponding to different anomalies comprising contamination, leakage, unbalanced rotation, and rail damage were collected with background noise from four machines comprising a fan, pump, valve, and slide rail in an actual factory [23]. Eight-channel microphone arrays with a sampling rate of 16 kHz placed 45° apart from one another were used in the collection rig. Among the eight microphones, the sound from the microphone nearest to each machine was used for the machine in the dataset. For example, the sound from the microphone at 180° was used for the fan, that from the microphone at 270° for the slide rail, that from the microphone at 0° for the valve, and that from the microphone at 90° for the pump. To construct the dataset for validating the model, the.wav audio files were converted into.mat files, and each.mat file was resized to a length of 1024. Three environments with SNRs of − 6 dB, 0 dB, and + 6 dB are included in the MIMII dataset, as summarized in Table 2. There are eight classes of faults in the MIMII dataset comprising the normal and abnormal conditions for the fan, pump, slider, and valve. The dataset contains 400 signals for the normal fan, pump, slider, valve, and abnormal fan, and 143, 356, and 119 signals for the abnormal pump, slider, and valve, respectively.

Table 2 Details of environments with various SNRs in MIMII dataset

ToyADAMOS is a machine operating sound dataset that was collected from four microphones at a sampling rate of 48,000 Hz [24]. There are two types of sounds comprising normal and anomalous sounds for three different toy machines comprising a toy car (machine-condition inspection), toy conveyor (fixed), and toy train (moving). The dataset contains sounds for three different environments denoted as case1, case2, and case3. Each case contains a total of 72,000 individual samples, which includes normal and anomalous sounds from the toy car, toy conveyor, and toy train, as summarized in Table 3. The anomalous sounds were collected by damaging the machine components or adding additional objects. The three cases in the toy car data were generated by changing the motor and bearing, those in the toy conveyor cases were generated with three different sizes of machines, and those in the toy train cases were generated with different types and scales of toy trains. Some sample signals from the MFPT, MIMII, and ToyADAMOS datasets are shown in Fig. 2.

Table 3 Details of various environments in ToyADAMOS dataset
Fig. 2
figure 2

Sample signals from the MFPT, MIMII, and ToyADAMOS datasets

2.2 Hilbert transform 2D grayscale image generations

In this study, the Hilbert transform is used to generate 2D images from the original bearing fault signals. It is an effective method for performing spectrum analysis on time-domain signals. It operates on real-time time-domain signals without the need to perform transformations into the space or frequency domains, unlike the Fourier and wavelet transforms. Since the Hilbert transform is a complex operator, performing the Hilbert transform on the time-domain signal \(y\left(t\right)\) produces an analytical signal \(\overline{y}(t)\), which has a real and imaginary part.

The Hilbert transform of a signal y(t) can be written as

$$y\left(t\right)\stackrel{\mathrm{HT}}{\to }\overline{y}(t)={\overline{y}}_{\mathrm{Re}}\left(t\right)+j.{\overline{y}}_{\mathrm{Im}}\left(t\right)$$
(1)

where\(\overline{ y}(t)\) is an analytic signal and \({\overline{y}}_{\mathrm{Im}}\left(t\right)\) represents the Hilbert transform of the signal \({\overline{y}}_{\mathrm{Re}}\left(t\right)\). The amplitude A(t) of the time signal y(t) is given by

$$A\left(t\right)=\sqrt{{{\overline{y}}_{\mathrm{Re}}}^{2}\left(t\right)+{{\overline{y}}_{\mathrm{Im}}}^{2}(t)} .$$
(2)

This technique was used to extract the detailed phase shift between the real and imaginary components in an earlier study [25]. The spectrum amplitude was used to detect the existence of faults in electric machines. Since the spectrum amplitude of the analytic signal is obtained from its real and imaginary components, the amplitude of the analytical signal is used to extract the uniform texture pattern of the fault signals in this study.

The generation of the 2D grayscale images from the original fault signals by applying the Hilbert transform is briefly described here. Since the analytical signal is complex, three different types of signals comprising the real and imaginary parts and the absolute values may be extracted from a time-series signal, as shown in Fig. 3a.

Fig. 3
figure 3

Texture image generation from sample vibration fault signals. a Real and imaginary parts and absolute values of Hilbert transform signals for three sample fault signals, b amplitude of signal components in (a), and c texture patterns generated from signals in (b)

Among the three parts of the analytical signal, we consider only the conversion of its amplitude (i.e., absolute value, as depicted in Fig. 3b) to a 2D grayscale image following the approach in [8]. The 1D signal with a length of 1024 was subdivided into blocks with a length of 32 to generate a 32 × 32 2D grayscale image. Sample texture images generated from the analytical inner and outer fault signals are presented in Fig. 4. The images demonstrate a uniform texture pattern for each type of signal. A 2D deep CNN–LSTM model was utilized to classify the faults represented as uniform 2D grayscale images, as discussed in the following subsection.

Fig. 4
figure 4

Sample reconstructed 32 × 32 texture images: a inner and b outer race signals

2.3 2D Deep CNN–LSTM hybrid architecture for fault recognition

The architecture was constructed using the Keras sequential API. The signals were reshaped for image generation using the NumPy library. These images represent different signals belonging to the various classes and the image size was input into the Keras model. The input data were connected with the first layer of the entire neuron list, and the entire pixel list of all the images was forwarded to the first layer. The input shape was initially declared to avoid a merger between several images. To reduce the computation time, the input shape was set to 32 × 32. Since processing each image using a separate convolution block leads to several problems such as a long training time, the extraction of different features from additional images, and frame-to-frame changes in the characteristics of the time-series data, a series of time-distributed layers were utilized so that similar transformations could be performed across the list of the input images. Although the input shape information comprising the height, weight, and number of channels is usually used in the convolutional layer, the number of images that are inserted in each turn should be specified for the time distribution conv2D. Therefore, the input shape was set to (1, 32, 32, 1), where the first 1 indicates that one image goes through the subsequent layer in each turn.

The number of channels was set to one because the input images were grayscale ones. The input shape did not need to be further declared for the subsequent layers because the Keras library could guess the perfect shape for connecting with other layers. The kernel size was set to 5 × 5 at the beginning and to 3 × 3 subsequently. The number of filters in the first layer was set to 32 because the lower layers extract features from the smaller parts of the images and increased to 64 in the subsequent layers to detect the high-level features. The ReLu activation function was added to every time-distributed conv2D layer to introduce nonlinearity to the system. A 2 × 2 pooling layer (MaxPool2D) was added after each of the conv2D layers to down-samples the filter size and select the largest value from the two neighboring pixels. A dropout layer with a dropout rate of 25% was added after each max pooling layer to drop a quarter of the neurons randomly to avoid overfitting. The DCNN contains three more conv2D layers with a filter size of 64. Further, the current output was flattened and then transformed into a 1D vector for input to the LSTM. After the time-distributed layers, the images were processed frame by frame in a chronological manner by using two LSTM layers with 64 units or LSTM cells. The sigmoid activation function was used in the LSTM to form a smooth curve varying from 0 to 1. The return sequences were set to true in response to the output of every node to avoid generating only a single output at the final node. Two dense layers associated with 64 neurons and the ReLu activation function were added followed by a 30% dropout layer. The output size of the dense layer varied depending on the dataset. For example, since the MIMII dataset has eight classes, the sigmoid activation function was used to derive the probability of the eight neurons corresponding to these particular eight classes. A similar approach was applied for the MPFT and ToyADAMOS datasets, for which the dense layer output size was set to 2 and 6, respectively. The sigmoid activation function was used in the dense layer and the RMS prop was used as an optimizer with a learning rate of 0.0001. The categorical cross-entropy was used as a loss function to detect the classes in the model. In the experimental evaluation, all the simulations for the three datasets were run for 100 epochs. The model architecture and the detailed layer information are presented in Fig. 5.

Fig. 5
figure 5

Architecture of proposed hybrid model for detecting fault signals starting with image resizing followed by four time-distributed conv2D layers and LSTM layers with 64 units

To implement TL in the proposed model, the weights of the training dataset were saved and utilized later for testing the fault dataset. It is expected that the domain-specific pre-trained weights can have a significant positive impact on the accuracy of the test dataset.

3 Experimental results analysis

For the experimental evaluation, we used the F1 score to numerically analyze the performance of the proposed fault diagnosis model. The F1 score is derived from the precision (how consistent the results are over repeated measurements) and recall statistical parameters, as shown in Eqs. 35:

$${\text{Precision}} = \frac{{{\text{True}}\,{\text{Positive}}}}{{{\text{True}}\,{\text{Positive}} + {\text{False}}\,{\text{Positive}}}}$$
(3)
$${\text{Recall}} = \frac{{{\text{True}}\,{\text{Positive}}}}{{{\text{Original}}\,{\text{Positive}}\,{\text{Number }}\,{\text{of }}\,{\text{Data}}}}$$
(4)
$$F1{\text{ Score}} = 2 \times { }\frac{{{\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}}$$
(5)

For consistency, the same hyperparameters such as the learning rate, batch size, and epochs were used in all the experiments to evaluate the performance of each model in a similar environment.

3.1 Performance evaluation of proposed model

In the CNN–LSTM architecture, the front-end CNN layers and the LSTM layers (the details of the layers were presented in Sect. 2.3) function as feature extractors. In a previous study [11], the LSTM model was used as a stand-alone model for inputting the raw data signals, and the model was trained with only sequential information. In the proposed architecture, the temporal depth and extracted features from the Hilbert transform 2D images are extracted by the deep CNN layers and fed to the LSTM layers. The results for Experiments 1, 2, and 3 performed using the MFPT, MIMII, and ToyADAMOS datasets, respectively, are discussed below.

3.1.1 Experiment 1: MFPT dataset

The MFPT dataset is smaller than the sub-datasets generated from the MIMII and ToyADAMOS datasets. However, this dataset has more variance because the signal loads varied from 50 to 300 lbs. Nonetheless, the hybrid architecture generalized better compared to conventional architectures and performed impressively when it was combined with the Hilbert transform 2D images. By utilizing the added information, it achieved an F1 score of 1 at only 20 epochs for both environments (Environment 1 for 50–150 lbs and Environment 2 for 200–300 lbs) and maintained the score throughout the remainder of the training. This indicates that the model was able to learn very quickly without overfitting, which would have resulted in spikes in the training and validation curves at every epoch. Figure 6 shows that the training and validation loss curves converged smoothly after less than 20 epochs and maintained a net-zero loss. All the batches achieved very good results and maintained zero validation loss in both environments.

Fig. 6
figure 6

Training and validation loss of the proposed model for Environment 1 (50–150 lbs) and Environment 2 (200–300 lbs) of MFPT dataset

There are only two classes comprising inner and outer race faults in the two environments. 20% of the data was used for testing in the experimental evaluation. For both environments, the proposed model successfully classified both classes accurately (as shown in Fig. 7) based on the amplitudes of the analytical signal images. The good performance of the proposed model across a varied range of loads reflects its broad applicability.

Fig. 7
figure 7

Confusion matrices of proposed model for different environments in the MFPT dataset

3.1.2 Experiment 2: MIMII dataset

Compare to the vibration dataset used in Experiment 1, the MIMII dataset is more complex. Various actual industrial noisy environments were captured in the collected audio signals. The MIMII dataset contains a total of 7854 sample signals across three environments with noise levels of − 6 dB, 0 dB, and + 6 dB. There are eight classes of data, which are denoted as 0 for fan (normal), 1 for fan (abnormal), 2 for pump (normal), 3 for pump (abnormal), 4 for slider (normal), 5 for slider (abnormal), 6 for valve (normal), and 7 for valve (abnormal) in the confusion matrix. The dataset was split in an 80:20 ratio for training and validation and run for 100 epochs for all cases and the losses are shown in Fig. 8. Figure 8 shows that for the − 6 dB noisy case, the hybrid model took a longer time to converge compared to the 0 dB and + 6 dB noisy cases. However, after the sixtieth epoch, it stabilized and reached its optimal state with a maximum F1 accuracy of 99.6% on the test dataset. The signals with + 6 dB converged faster with less performance variance.

Fig. 8
figure 8

Training and validation loss curves of proposed model for different noisy environments in MIMII dataset

Similar to Experiment 1, the proposed model correctly detected the eight types of faults in the three environments except for only a very few signals. The model accurately detected the normal and abnormal of fan, pump, and slider audio signals in most cases except only one signal in 0 dB and + 6 dB. However, four valve normal signals in the − 6 dB case were not detected successfully. The detailed confusion matrices of the proposed model for the MIMII audio machine dataset are shown in Fig. 9.

Fig. 9
figure 9

Confusion matrices of proposed model for MIMII dataset for SNRs of a − 6 dB, b 0 dB, and c + 6 dB

3.1.3 Experiment 3: ToyADAMOS dataset

In this experiment, a complex and larger dataset, i.e., ToyADAMOS, was used to evaluate the performance of the proposed model. There are six classes of data comprising the ToyCar_normal, ToyCar_anomalous, ToyConveyer_normal, ToyConveyer_anomalous, ToyTrain_normal, and ToyTrain_anomalous classes and a total of 72,000 samples in the three cases in the dataset. These classes are, respectively, denoted as 1–6 in the confusion matrices in Fig. 10. Similar to Experiments 1 and 2, the dataset was split in an 80:20 ratio for training and testing. The simulations were run for 100 epochs for all the cases. The model accurately detected all kinds of fault signals in all the three cases, as shown in Fig. 10. In all the cases, the model took a few epochs to converge similarly to the other datasets presented in Experiments 1 and 2. As shown in Fig. 11, the model stabilized after the twentieth epoch and reached its optimal state on the test dataset.

Fig. 10
figure 10

Confusion matrices of proposed model for different environments in ToyADAMOS dataset

Fig. 11
figure 11

Training and validation loss curves of proposed model dataset for different environment cases in ToyADAMOS dataset

3.2 Comparison with other state-of-the-art models

A detailed comparison of the proposed hybrid architecture with three state-of-art transformed-signal techniques based on conventional CNN and LSTM DL models using the MFPT, MIMII, and ToyADAMOS datasets is presented in this section.

Table 4 shows a performance comparison of the Hilbert transform-based texture extraction method with methods based on DWT, FFT, and the gammatone spectrogram on the MIMII dataset. The FFT-based method exhibited the worst performance, while the Hilbert transform-based method outperformed the other models. The overall performance of DWT depends on the selected kernel function and has poor directionality is shift-invariant and does not contain phase information. We utilized the db4 kernel function for the evaluation. In contrast, similar to the Hilbert transform, the transformed signal after FFT also contains magnitude and phase information. However, the domain conversion results in high latency as the data is not processed in the same order as the input data. The application of several filter banks in the gammatone spectrogram leads to higher computational complexity than the other transform techniques. For all three environments in the MIMII dataset, the Hilbert transform-based texture feature extraction achieved better F1 scores than the average F1 scores of 0.94, 0.88, 0.90, and 0.99 for the DWT, FFT, and gammatone spectrogram-based transform methods, respectively.

Table 4 F1 scores of DWT, FFT, and Hilbert transform analytical signal-based texture extraction with deep CNN–LSTM model for MIMII dataset

The hybrid model has fewer higher trainable parameters than the stand-alone DCNN, LSTM, and state-of-art TL architectures. More specifically, the DCNN, LSTMs, and DCNN–LSTM architectures have 328,102, 62,406, and 1,168,294 trainable parameters, respectively, while the VGG16 architecture has 14,779,974 parameters, which is more than 10 times that of the hybrid architecture.

The experimental results in Table 5 show that all the architectures successfully detected the fault signals in the MFPT dataset under various load conditions. Although all the models detected the fault signals accurately, the train and validation loss curves of the proposed architecture were smoother and converged earlier than those of the other state-of-art models. In contrast, for the MIMII and ToyADAMOS datasets, the deep LSTM model exhibited lower training accuracy than its validation and testing accuracies because it could not extract deeper features as a stand-alone model. The LSTM model failed to learn and predict several cases that the DCNN and proposed architecture managed to successfully. The proposed model outperformed the other models under the different environments of the three datasets by achieving higher accuracy and smoother training and validation curves than those of the state-of-art models.

Table 5 Accuracies of deep CNN, deep LSTM, and proposed model for MFPT, MIMII, and ToyADAMOS datasets

The results demonstrate that the proposed model is efficient not only for a particular dataset but also for the different environments in the three datasets—the MFPT dataset covers loads with large variances, the MIMII dataset contains both negative-scale and positive-scale noisy cases, and ToyADAMOS contains data from different machines with various specifications. This demonstrates an important aspect of the model performance under different conditions that may be present in real-life situations. It can thus be concluded that the proposed model classifies faults efficiently and makes accurate predictions in the given environments with varying data complexities.

3.3 Implementation of transfer learning in proposed model

The environment for rotatory machine fault detection can vary because of environmental variations and the physical characteristics of the machines. This study is therefore limited to environment-specific conditions. To reduce the gap between the different environments, TL was implemented by interconnecting the various environments.

To reduce the training time, several researchers have recently used ImageNet pre-trained weights to test for fault signals, as discussed in Sect. 1.1. In this study, the trained weights obtained using a source fault dataset were saved and used for training/testing the target datasets with a completely different set of conditions. For example, we trained the model using case 1 of the ToyADAMOS dataset. The model took more than 50 epochs to converge. The weights were then saved, the model was retrained using case 2, and the previous weights were updated according to the new environment samples. The model took only approximately 10 epochs to converge this time because it has already learned features from the previous training. The model required even fewer epochs to converge in case 3, as shown in Fig. 12. Similarly, for MFPT, we trained the model using the data for 50–150 lbs loads. Although we ran the model for 100 epochs, the best model was obtained after the twentieth epoch. The best weights were then utilized for training using the data for 200–300 lbs loads. This time, the model took only seven epochs to converge with the trained weights. Since the MFPT dataset has less variance and more differentiable features between its two classes, it converged more quickly compared to the other two datasets.

Fig. 12
figure 12

Training and validation loss curves for proposed model under different environments in the ToyADAMOS dataset: a case 1 as source task and case 2 as target task; b case 1 as source task and case 3 as target task

The same approach was applied for the MIMII dataset, as presented in Table 6. The model was pre-trained with − 6 dB noisy data and subsequently trained with the two remaining sets of environmental data (0 dB and + 6 dB noisy). The experimental results show that as the noise increased, the model took less time to converge and better models were achieved. After implementing TL on the proposed model, it took less than 20 epochs of the faulty signals in all datasets to be classified accurately instead of the 100 epochs required without implementing TL. Since TL-based hybrid DL significantly reduces the training time, it is highly suitable for real-time industrial fault diagnosis in various environments.

Table 6 Computational complexity with several epochs of the proposed model with and without TL

4 Conclusion

This paper presented an industrial fault diagnosis model with Hilbert transform and a 2D deep CNN–LSTM architecture. The model was used to classify faults in different environments with loads ranging from 0 to 300 lbs and noises from − 6 dB to + 6 dB. Two environments were included in the MFPT dataset and three environments each in the MIMII and ToyADAMOS datasets. The Hilbert transform analytical signal-based texture extraction method was compared with the state-of-the-art DWT, FFT, and gammatone spectrogram-based methods. The Hilbert transform-based 2D image generation outperformed the state-of-the-art transform methods because it extracted more efficient features. The F1 score was used as a performance metric to evaluate the performance of the proposed and state-of-the-art models. The state-of-the-art models did not perform consistently well as the motor load and noise increased. In contrast, the proposed model exhibited consistent performance in all environments with varying loads, RPMs, and noise levels. The proposed model also had 12 times less trainable parameters than state-of-art TL models. Implementing TL with domain-specific fault datasets reduced the average training time over five times. This reduced the time required compared with training for every machine-specific environment from scratch. It is therefore expected that the proposed model can play a significant role in real-time industrial fault diagnosis in environments with various loads, RPMs, and noise levels.

The performance of the proposed hybrid DTL model was evaluated under various environments in a single machine. In future work, incremental learning techniques can be evaluated with more complex fault datasets from different machines.