1 Introduction

Due to the complexity of some tasks in different research areas [7, 15], even when using deep learning techniques, exploiting different features using fusion approaches to obtain the final results can provide some improvements for the tasks [14]. It is well explored that the fusion process can improve the classification rate when multiple sources are considered [20]. The process aims to provide a combination of local information in different views [44]. For classification problems, fusion can occur at the feature [37] and classifier levels [38]. Fusion at the feature level aggregates features extracted from multiple sources into a common space that can be represented as a single space or separated spaces. Fusion at the classifier level obtains a decision from a combination of individual classifiers by training each view separately. Feature fusion was developed from concatenating features simply to complex fusion methods. Although a new fusion method can obtain better results than a traditional method (concatenating features), existing noise among the features can affect their accuracy. To solve this problem, different approaches have been used. One of the approaches concentrates on separating the views into common aspects [21].

Fig. 1
figure 1

Overview of proposed system

To solve these problems, this paper presents a novel fusion approach in which the views (feature extraction methods) are separated into high and low frequencies using the wavelet transform. The wavelet transform separates the components of the features into different frequency bands, allowing a sparser representation of the features. Simultaneously, the decomposition reduces the impact of noise on the fusion methods. The values of the wavelet coefficient with respect to the low- and high-frequency subbands disclose important information related to the signal structure (feature). Both frequency subbands usually imply spikes, with the high and low values of the wavelet coefficient corresponding to complex spikes and a smooth region, respectively. Since the effects of the separated information based on the wavelet transform were presented in [4, 5, 6, 40], the separated information, which can be any change in the original feature, such as applying filters to the signal, is useful for improving the accuracy of the classification step. The purpose of using low- and high-frequency wavelet subbands is to filter out the noise while preserving the feature map structures very well. The noise in the features affects the features learned by the classifier. We also want to preserve the structures of the feature maps. These structures can be very useful for feature-level fusion tasks. We use joint sparse representation for fusing low and high frequencies as one of the popular fusion methods. Figure 1 illustrates the overview of our proposed approach. As mentioned above, the fusion methods, especially sparse representation methods, suffer noise of features, which can affect their results [16]. Therefore, to avoid this problem, wavelets can be a sufficient choice. Additionally, we explore the impact of decomposition levels and two states that can occur in the fusion step. The first state is shown in Fig. 1, namely low and high frequencies are fed into the fusion method separately, and the other state, with all of the frequencies, can be fused simultaneously. The main contributions of this work are described as follows:

  • Multifeature fusion approach: A novel fusion approach combining wavelets and joint sparse representation is presented.

  • Exploring view separation in a fusion approach: Separating features in a sparse frequency space for classification problems is investigated for the first time (to the best of the authors’ knowledge).

  • Improved accuracy for multiview classification: We show that compared with the fusion approaches, the proposed methods achieve superior performance.

The remainder of this paper is organized as follows. Section 2 provides an overview of the related works, and the proposed approach is presented in Sect. 3. Experimental results are reported in Sect. 4, while Sect. 5 concludes this paper.

2 Related works

The aim of the multifeature is to reveal and relate the correlation of features across different views. Approaches to address the aim (similarity across features) can be categorized into three groups: multikernel learning [29, 39], subspace learning [22, 46], and sparse representation [1, 2, 8]. Since we focus on the sparse representation approach, we explore the state-of-the-art category. Due to the attractiveness of many researchers in using sparse representation, approximating data by considering a few dictionary atoms was proposed [1, 8, 19, 25, 26, 27, 28, 30, 31, 49, 50, 52]. A relaxed collaborative representation (RCR) approach was proposed in [49]. They assumed to represent different features that can consider their coefficients common. This obtains a result by minimizing the sparse codes through counting the sum of the distance of coefficients from their average. Reference [50] considered the \(l_{1},l_{2}\) norm to obtain a joint sparse representation for the multiple features (MTJSRC) and tested their methods on the data with high-dimensionality. Li et al. [27] proposed a joint discriminative collaborative representation (JDCR) approach to fuse multiple features with the aim of obtaining both similarities and discriminatively the representation coefficients. Reference [19] presented a joint feature extraction to align multifeature groups and introduced a feature selection method for dimensionality reduction. Partial multiview clustering (PVC) was presented in [30] in which data were considered incomplete. They used nonnegative matrix factorization (NMF) [25] to train a latent subspace. In [8] and [26], a sparse representation model based on dictionary learning was introduced that obtained promising results when multimodal features were considered. Due to the assumption of missing data in the multifeature extraction step, Zhao et al. [52] presented a partial multifeature unsupervised framework by preserving the similarity structure across different features. Nonparametric sparsity-based learning to reduce the dimensionality of multiple features using the matrix decomposition method was presented in [31]. In [28], to learn multiple features extracted for the problem of diabetes mellitus and impaired glucose regulation, both specific and similar components were used, and effective results were reported.

Although the mentioned methods to fuse multiple features have achieved promising results in different classification and clustering applications, the methods can be improved by some changes. Thus, we present a novel multiview learning approach to improve the methods. In general, more methods use all features simultaneously and follow two common structures as shown in Fig. 2. In the first structure (Fig. 2a), the fusion method is applied to all views, and the result is a set of fused views corresponding to each view, i.e. the number of views after fusion is equal to the original views. Then, a classifier can be used for each view. Finally, fusion can be performed with the classifiers. In the second structure (Fig. 2b), the output of the fusion method is a single feature space that can be fed into a classifier. The method proposed in this study uses the first structure.

Fig. 2
figure 2

Different structures for multifeature fusion

3 Proposed method

Background information about our steps, including wavelet transform and joint sparse representation, and our implementation are presented in the following subsections.

3.1 Wavelet transform

Wavelets as a tool can analyse signals with discontinuities and sharp spikes. To implement this tool, the high-pass and low-pass functions are exploited. Structures such as high and fast fluctuations are better preserved and can be used for noise removal when compared to other transforms. Additionally, they can play a great role in extracting features. To implement a 1D wavelet transform, we use two high-pass and low-pass filters [34]. Let \(S=\left\{ s_{1}, s_{2},..., s_{N} \right\}\) be the dataset captured in N different views (feature extraction methods). For each view, we have a feature vector whose length varies based on the feature extraction method. We compute approximation (cA) and detail (cD) subbands for the decomposition step by convolving the set with a high-pass filter (Hi_F) for obtaining detail coefficients and a low-pass filter (Lo_F) for obtaining approximation coefficients:

$$\begin{aligned}&cD=s_{n} *Hi\_F (H wavelet), \end{aligned}$$
(1)
$$\begin{aligned}&cA=s_{n} *Lo\_F (L wavelet), \end{aligned}$$
(2)

where the H and L wavelets correspond to high frequency and low frequency, respectively. Additionally, they are presented as the low-pass and high-pass function results.

3.2 Multiview join sparse representation

Since wavelets produce frequencies in sparse space, we select a fusion method based on sparse representation. An efficient tool for fusing multiple features is joint sparse representation [12, 51]. If we have \(FE=[1,...,FE]\) as a finite set of available feature extraction methods and \(X^{FE}=[x_{1}^{fe},x_{2}^{fe},...,x_{N}^{fe}]\in {\mathbb {R}}^{n^{fe}\times N}, fe \in FE\) as the collection of N (normalized) training samples of the methods, we can statistically assume the independence of the data (\(x^{fe}\) is the feature vector for the \(s^{th}\) method (view)). To address the fusion step, the method formulates it by dictionary representation \(D^{fe}\in {\mathbb {R}}^{n^{fe} \times d}\) corresponding to the \(s^{\mathrm{th}}\) method. Therefore, we have the multi-feature dictionaries constructed by data extracted from different methods. That is, the \(j^{\mathrm{th}}\) atom of dictionary \(D^{fe}\) is the \(j^{\mathrm{th}}\) data produced by the \(fe^{\mathrm{th}}\) method. If \(\left\{ x^{fe}\mid fe \in FE \right\}\) is a multifeature sample, then we can solve the \(\imath _{12}\)-regularized reconstruction problem to obtain the optimal code sparse matrix \(A^{*} \in {\mathbb {R}}^{d\times FE}\):

$$\begin{aligned} l(x,D)&\doteq \min _{A\left[ \alpha ^{1}... \alpha ^{FE} \right] } \frac{1}{2}\sum _{fe=1}^{FE}\left\| x^{fe} -D^{fe}\alpha ^{fe} \right\| ^{2}_{\imath _{2}} \nonumber \\&\quad + \lambda _{1}\left\| A \right\| _{\imath _{12}} +\frac{\lambda _{2}}{2}\left\| A \right\| _{F}^{2}, \end{aligned}$$
(3)

where the regularizing parameters are \(\lambda _{1}\) and \(\lambda _{2}\). To obtain a unique solution, the Frobenius norm \(\left\| . \right\| _{\mathrm{F}}\) term is added for the joint sparse optimization problem [8]. Here, \(\alpha ^{fe}\) is the \(fe^{\mathrm{th}}\)- column of A which shows the sparse representation for the \(fe^{\mathrm{th}}\) method. The \(\imath _{2}\) norm of a vector \(x\in {\mathbb {R}}^{m}\) and the \(\imath _{12}\) norm of matrix \(X \in {\mathbb {R}}^{m\times n}\) are defined as \(\left\| x \right\| _{\imath 2} =(\sum _{j=1}^{m}\left| x_{j} \right| ^{2})^{1/2}\) and \(\left\| X \right\| \imath _{12}=\sum _{i=1}^{m}\left\| x_{i\rightarrow } \right\| _{\imath _{2}}\) (\(x_{i\rightarrow }\) is the \(i^{\mathrm{th}}\) row of matrix), respectively. To solve the optimization problem, several algorithms have been proposed [36] wherein to find \(A^{*}\), we applied the efficient method of multipliers (ADMM) [35]. Multimodal dictionaries are obtained by the optimization problem:

$$\begin{aligned} D^{fe*}=\underset{D^{fe}\in {\mathbb {D}}}{\arg \min } \ E_{x^{fe}}[l(x^{fe},D^{fe})], \quad \forall fe\in FE \end{aligned}$$
(4)

where the convex set \({\mathbb {D}}\) is defined as:

$$\begin{aligned} {\mathbb {D}}^{fe}\doteq ~\left\{ D\in {\mathbb {R}}^{n^{fe \times d}} \mid \left\| d_{k} \right\| _{\imath _{2}}~\le 1,~ \forall k=1,...,d\right\} . \end{aligned}$$
(5)

Data \(x^{fe}\) are assumed to come from a finite (unknown) probability distribution \(p(x^{fe})\). A classical projected stochastic gradient algorithm [3] can be used to solve the optimization problem above and gives a sequence of updates for each iteration:

$$\begin{aligned} D^{fe} \leftarrow \Pi _{{\mathbb {D}}^{fe}}[D^{fe}-\rho _{t} \triangledown _{D^{fe}}l(x^{fe}_{t},D^{fe})], \end{aligned}$$
(6)

where \(\rho _{t}\) is the gradient step at time t, and \(\Pi _{{\mathbb {D}}}\) is the orthogonal projector on the set \({\mathbb {D}}\). The algorithm converges to a stationary point for a decreasing sequence of \(\rho _{t}\) [3, 9]. Note that the stochastic gradient descent converges but is not guaranteed to converge to a global minimum due to the nonconvexity of the optimization problem [10, 33]. However, experience shows that such a stationary point is sufficiently good for practical applications [13, 32].

To implement our approach, the discrete wavelet transform (DWT) produces set \(W_{\mathrm{coe}}=\left\{ lf_{i}^{fe},hf_{i}^{fe} \right\}\) as one low-frequency lf and one high-frequency hf band for each view, and i shows the level of the applied wavelet. A sample of the decomposition step is shown in Fig. 3.

Fig. 3
figure 3

Sample of decomposition step for two views of IXMAS dataset

Th joint sparse coding is computed using Eq. (3) for \(W_{coe}\) as follows:

$$\begin{aligned}&\min _{A_{l}\left[ \alpha ^{lf_{i}^{1}}, ..., \alpha ^{lf_{i}^{FE}}\right] } \frac{1}{2}\sum _{z_{f}=lf_{i}^{1}}^{z_{f}=lf_{i}^{FE}}\left\| {z_{f}} -D^{z_{f}}\alpha ^{z_{f}} \right\| ^{2}_{\imath _{2}} \nonumber \\&\qquad + \lambda _{1}\left\| A_{l} \right\| _{\imath _{12}} +\frac{\lambda _{2}}{2}\left\| A_{l} \right\| _{F}^{2}, \end{aligned}$$
(7)
$$\begin{aligned}&\min _{A_{h}\left[ \alpha ^{hf_{i}^{1}}, ..., \alpha ^{hf_{i}^{FE}}\right] } \frac{1}{2}\sum _{z_{f}=hf_{i}^{1}}^{z_{f}=hf_{i}^{FE}}\left\| {z_{f}} -D^{z_{f}}\alpha ^{z_{f}} \right\| ^{2}_{\imath _{2}} \nonumber \\&\qquad + \lambda _{1}\left\| A_{h} \right\| _{\imath _{12}} +\frac{\lambda _{2}}{2}\left\| A_{h} \right\| _{F}^{2}, \end{aligned}$$
(8)

Finally, the corresponding inverse \(W_{\mathrm{coe}}\) over \(lf_{i}^{fe}\) and \(hf_{i}^{fe}\) based on \(A_{l} \; and \; A_{h}\) is applied to reconstruct the classifiers inputs. Key steps in the feature extraction and fusion steps based on wavelet transform are listed in Table 1.

Table 1 Key steps in feature extraction and fusion steps based on our approach

Note that instead of Steps 2 and 3, we can feed all frequencies into the joint sparse representation simultaneously. The impact of the separation is explored in the experimental results section.

3.3 Classification

For the decision step, the scores of the modal-based classifiers can be combined. The formulation used simultaneously trains the multimodal dictionaries and classifiers under the joint sparsity prior. To classify the classes of multiview problems and obtain a fair comparison, we use the classifiers proposed in [8]. The classifier is based on the joint sparsity before enforcing collaborations among the multiple features and obtains the latent sparse codes as the optimized features for multiclass classification. The performance of these classifiers is studied in the next section. To make the final decision of the classifiers, there are several ways, such as adding corresponding scores and majority voting. In this study, the sum of the score for each feature group is used.

4 Experiments

To evaluate the effectiveness of the proposed system, experiments were conducted on three datasets: IXMAS [47], Animal [24], and NUS-Object [11]. The described method was compared with state-of-the-art methods including JSRC [41], Wang et al. [45], PLRC [17], AFCDL [18], MDL [8],GradKCCA [42], and MvNNcor [48]. The experiments are elaborated in detail in the next subsections.

4.1 Dataset

The performance of proposed method is explored on IXMAS [47], Animal dataset [24], and NUS-Object dataset [11].

IXMAS has images from five different views that can be viewed as a multiview dataset. For each view, there are 11 classes, such as cross arms, scratch head, and check watch. The extracted features and the distribution of the training, validation and test samples are set similarly to [8].

Animal dataset contains 30,475 images with 50 animals classes based on six feature extraction method: color histogram (CH) features, local self-similarity (LSS) features, pyramid HOG (PHOG) features, SIFT features, color SIFT (RGSIFT) features, and SURF features. This can be considered a multiview dataset. The distribution of training, validation, and testing samples are set similar to [48].

NUS-Object has 30,000 images in 31 classes. Methods of CH, color correlogram (CORR), edge direction histogram (EDH), wavelet texture (WT), and block-wise color moments (CM) are applied to extract features. Distribution of training, validation, and testing samples are set similar to [48].

4.2 Experimental setting

The proposed approach was simulated using MATLAB R2019a. All experiments were run on a 64-bit operating system with a CPU E5-2690 and 64.0 GB of RAM. To obtain a fair comparison, we considered all parameters fixed for the fusion method that is introduced in [8] and tested on the databases. In [8], all parameters were carefully analysed. In the joint sparse representation, regularization parameters \(\lambda _{1}\) were selected using cross validation in the sets \(\left\{ 0.01 + 0.005t\mid t \in \left\{ -3,3 \right\} \right\}\). The parameter \(\lambda _{2}\) was set to zero in most of the experiments as proposed in [8]. Due to its performance in anomaly detection, the Daubechies-2 (db2) wavelet was used to decompose the datasets into a series of subbands [23]. To determine the wavelet decomposition levels, three levels of decomposition were performed, as the use of more levels does not affect the detection rates. We also analysed the effects of the levels in Sect. 4.4.

4.3 Results

The proposed method is compared with the other fusion approaches that have been applied to the three datasets. The performance evaluation results (average accuracies) on IXMAS and Animal and NUS-Object are summarized in Tables 2 and 3, respectively. For the IXMAS dataset, we compare our approach with the joint sparse representation classifier (JSRC) [41], Wang et al. [45], Multimodal Task-driven Dictionary Learning (MDL) [8], Pairwise Linear Regression Classification (PLRC) [17], and adaptive fusion and category level dictionary learning (AFCDL) [18]. As shown in Table 2, our approach obtains the second rank in terms of accuracy measurement. Note that our setup for selecting features and training and testing samples is based on [8]. Therefore, it does not lead us to have a fair comparison with other approaches.

The multiview approaches, MDL [8], GradKCCA [42], and MvNNcor [48], are compared in terms of the Animal and NUS-Object datasets, which are two challenging datasets.

Table 2 Comparison of average accuracies (%) between different fusion methods and our fusion approach on the IXMAS dataset (best value highlighted in bold)
Table 3 Comparison of average accuracies (%) between the different fusion methods and our fusion approach on the Animal and NUS-Object datasets (best value highlighted in bold)

As shown in Table 3, our approach achieves the best rank in terms of accuracy measurement.

In addition, to compare our fusion approach performance under the joint sparsity method, one of the best state-of-the-art feature-level fusion algorithms is considered, that is, multimodality dictionary learning [8]. As shown in both tables, we improve the results of the method significantly. Additionally, to analyse the feature space learned, we use the t-SNE algorithm [43] in terms of the IXAMS dataset to project the samples to 2 dimensions, as shown in Fig. 4, our approach distinguishes better than [8].

Finally, the typical computation time for our approach, including wavelet subband extraction and solving Eq. (3) for a given multimodal test sample compared to [8], is shown in Fig. 5 for different dictionary sizes. It shows that as the size increases, the computation time increases linearly for both approaches. The computational cost of our wavelet-based approach is very close to that of [8], where only the wavelet decomposition time is added.

Fig. 4
figure 4

Visualizations of a original data, b MDL [8], and c our approach using t-SNE on IXAMS dataset

Fig. 5
figure 5

Computational time for wavelet decomposition and solving optimization problem of Eq. (3) for a given test sample compared with [8]

4.4 Impact of decomposition levels

To investigate the effect of the wavelet transform on the classification rates, we perform a series of experiments on the datasets by varying the decomposition levels from 1 to 4. The classification rates are computed and illustrated in Table 4.

Table 4 Comparison of impact of wavelet transform levels (best values highlighted in bold)

It is clarified that after the third level, we cannot see any improvement in the results. We have the best results when we decompose features into three levels.

4.5 Impact of steps 2 and 3

The final experiments are dedicated to exploring our fusion step. To study the impact, we consider classifiers inputs in four states that can occur during fusion. To conduct the first state, we fed only low frequencies into classifiers after their fusion. This process was repeated for the second state with high frequencies. For the third state, we fused both low and high-frequency sets simultaneously without an inverse step. Finally, these were compared with our approach, as shown in Table 5. The results show that our approach obtains the best rank. Additionally, high frequencies do not obtain good results when they are only fed into classifiers. However, they can improve the results significantly when we use them in parallel.

Table 5 Comparison of impact of our fusion approach (best values highlighted in bold)

5 Conclusion

This paper proposed a novel fusion approach based on joint sparse representation using the wavelet transform by separating views into low- and high-frequency sets. We considered three datasets that included data with multiple views. In the first step, wavelet transform was applied to the different views, and approximation and detail coefficients were obtained. Then, we considered a fusion step at the feature level using the joint sparse representation tool. Low and high frequencies were fed into the fusion method separately. Using an inverse discrete wavelet transform, we reconstructed a new space based on both the low and high frequencies after applying the fusion method. To make a decision, the output of the step was fed into classifiers. The presented approach was tested on the three datasets. The results produced by this method were generally better in terms of the datasets than the state-of-the-art results.

Based on our experiments, we can claim that separating features can improve the results of fusion methods. Therefore, as future work, we aim to customize our approach based on other fusion approaches. Additionally, we aim to develop the model for a deep approach when the size of the feature vectors is sufficient for the purpose. Other filter banks can be applied for comparison with wavelet transforms.