Mining Sufficient Knowledge via Progressive Feature Fusion for Efficient Material Recognition

Material images are susceptible to changes, depending on the light intensity, visual angle, shooting distance, and other conditions. Feature learning has shown great potential for addressing this issue. However, the knowledge achieved using a simple feature fusion method is insufficient to fully represent the material images. In this study, we aimed to exploit the diverse knowledge learned by a novel progressive feature fusion method to improve the recognition performance. To obtain implicit cross-modal knowledge, we perform early feature fusion and capture the cluster canonical correlations among the state-of-the-art (SOTA) heterogeneous squeeze-and-excitation network (SENet) features. A set of more discriminative deep-level visual semantics (DVSs) is obtained. We then perform gene selection-based middle feature fusion to thoroughly exploit the feature-shared knowledge among the generated DVSs. Finally, any type of general classifier can use the feature-shared knowledge to perform the final material recognition. Experimental results on two public datasets (Fabric and MattrSet) showed that our method outperformed other SOTA baseline methods in terms of accuracy and real-time efficiency. Even most traditional classifiers were able to obtain a satisfactory performance using our method, thus demonstrating its high practicality.


Introduction
Material recognition [1][2][3][4][5] has become a popular topic in the field of computer vision (CV). It has significant value in many practical scenarios, such as scene recognition [6], industrial inspection [7], medical image recognition [8], and robot vision [9]. For example, self-driving cars or mobile robots need to determine whether an incoming terrain is asphalt, gravel, or grass. Cleaning robots must distinguish between wood, tiles, and carpets. Clearly, material is an important visual cue that is widely present on the surface of natural objects and has abundant visual diversity. However, it is difficult for humans to distinguish these subtle differences. A computer-aided material recognition model can address this issue well and applies to the aforementioned practical fields; however, it is still a great challenge, as material recognition is normally affected by several external factors, including light intensity, visual angle, shooting distance, and other conditions. Recently, feature learning has shown great potential for alleviating this problem.
Tan et al. [10] used a self-designed scatterometer to measure the bidirectional reflection distribution function (BRDF) of four different textured fabrics and analyzed the effects of surface texture and illuminating wavelengths. On this basis, to accomplish effective material recognition, they proposed a fast but simple BRDF model. Han et al. [11] transformed the original stimulus into a differential excitation domain according to Weber's law, and then they explored a local patch, called microtexton, in the transformed domain as the Weber local descriptor (WLD). Furthermore, the WLD space was characterized adaptively using a generative probability model, which can generate a more discriminative feature representation for material recognition. Tanaka et al. [12] used the depth distortion of time-of-flight measurements as features and achieved the material classification of a scene. eir method can classify visually similar objects. Erickson et al. [13] proposed a multimodal sensing technique to capture spectral and closerange high-resolution texture information. ey combined this high-resolution texture information with the output from a head-mounted camera on a robot to achieve accurate material classification. ese methods [10][11][12][13] mainly focus on capturing the visual variations in material images. However, obtaining such spectral reflectance is time-consuming and error prone.
Recently, deep learning features have played an important role in material recognition. Shahriari [14] used deep neural networks with variable depths to learn the scales, orientations, and resolutions of texture filter banks for effective material recognition. e corresponding computational cost was highly reduced, and their method could extract very deep features through distributed computing architectures. Jiang et al. [1] also proposed a deep learningbased method that combines both traditional and dilated convolutional features to achieve pixel-level material recognition. In their method, heterogeneous convolutional features are used to remove artifacts. Schwartz Nishino [2] probed the human visual perception of materials. ey compared pairs of image patches by asking some questions to complete the weak supervision required to build a set of attribute classifiers. Furthermore, they integrated their method into a convolutional neural network (CNN) that can simultaneously recognize materials and visual attributes. ese deep learning methods [1,2,14] usually require massive amounts of data to train effective recognition models. Owing to the large cost of annotations, high-quality training samples are very scarce in the field of material recognition.
Moreover, some state-of-the-art (SOTA) feature fusion methods, such as multiset discriminant correlation analysis (MDCA) [15], gene selection eXtreme gradient boosting (GS-XGBoost) [16], gene selection adaptive boosting (GS-AdaBoost) [16], and hierarchical multifeature fusion (HMF 2 ) [17], have been proposed to make full use of the limited training data. e MDCA model uses a single-layer early feature fusion algorithm for material recognition. However, the discriminative information is not balanced between the input and output interfaces of the model. To address this problem, the HMF 2 model implements multilayer early feature fusion using residual connections. It modifies the traditional MDCA model [15] and fully mines cross-modal knowledge among the heterogeneous features. More importantly, inspired by the well-known residual structure [18], a hierarchical feature fusion strategy was proposed in HMF 2 to gather effective knowledge for material recognition. However, the HMF 2 model requires numerous features and multiple fusion operations, which may slow down its real-time efficiency. Unlike these early feature fusion methods [15,17], GS-AdaBoost and GS-XGBoost perform middle feature fusion. ey both use traditional features, including scale-invariant feature transform (SIFT) [19], Gist [20], local binary pattern (LBP) [21], and deep learning-based VGG16 [22], to complete the material recognition. However, these two GS models ignore the beneficial early stage cross-modal knowledge among the heterogeneous features. e abovementioned feature learning approaches [1,2,[10][11][12][13][14][15][16][17] have shown remarkable progress in material recognition. However, the knowledge achieved by a simple feature fusion method is insufficient to completely represent material images, and few studies have exploited the implicit but valuable knowledge learned through progressive feature fusion. To address these issues, we propose a novel but effective progressive feature fusion method called SECGS that can seamlessly combine early and middle feature fusions together for material recognition. In our model, "SE" represents the well-known squeeze-and-excitation network (SENet) [23].
is means that we adopt only the SOTA SENet features for progressive feature fusion. "C" indicates that the SECGS model first mines the implicit cross-modal knowledge through cluster canonical correlation analysis [24]. It is a very important early feature fusion operation, as it not only creates a group of deep-level visual semantics (DVSs) but also serves as a basis for the subsequent middle feature fusion. "GS" indicates that we further capture the feature-shared knowledge among the implicit cross-modal knowledge using gene selection. Finally, deep-level but sufficient knowledge is mined for training material recognition model. e main contributions of this study are summarized as follows: (1) We proposed a novel progressive feature fusion approach that can mine sufficient and effective knowledge among the heterogeneous SENet features for material recognition.
(2) We conducted extensive experiments on two benchmark datasets. e results showed the superior performance of our model over other SOTA methods. With our approach, even traditional classifiers were able to achieve satisfactory performance. e code for our method is available at https://github.com/Danicaghost/SECGS.
(3) We created multiple variants of the SECGS model. ey are all effective for material recognition, which further validates the robustness of the proposed progressive feature fusion method. (4) We developed a real-time material recognition system based on our model and deployed it on a laptop. All we need is a mobile phone with a camera to complete the material recognition. is helps us to narrow the gap between theoretical research and practical applications. A demo video has also been uploaded to the above GitHub link.

Material Recognition.
Material recognition is an important research direction in the field of CV. Some researchers have used spectral reflectance from material surfaces to characterize materials [10,[25][26][27][28][29]. However, the acquisition of spectral reflection is time-consuming and error prone, which may reduce the practicality of the material recognition model. erefore, some researchers have focused on constructing local feature descriptors (such as WLD) [30], on using reflection-based edge features [31], or on adopting some CNN features. Feature learning helps in improving the accuracy of material recognition. For example, Cimpoi et al. [32] calculated the CNN-based Fisher vector (FV) [33] features and achieved the best performance on the well-known Flickr material database [34]. Bell et al. [35] combined a set of pretrained CNNs using a fully connected conditional random field model to predict the material label of each pixel. Schwartz and Nishino [2] introduced a material attribute-category CNN that considers the implicit attribute information for material recognition.
Bruna Mallat [36] proposed scattering convolution networks (ScaNet) that take Gabor or Haar wavelets as filters for material recognition. Sifre Mallat [37] proposed a rotationand scale-invariant feature extraction method for robust material classification. Inspired by ScaNet, Chan et al. [38] utilized local image patches as principal component analysis filters to complete the material recognition. e aforementioned studies [2,[32][33][34][35][36][37][38] mainly adopted a certain layer of the CNN model, without considering the implicit complementary information among the heterogeneous layers from a homologous neural network. is motivated us to mine the complementary knowledge among the heterogeneous SENet features, which helps improve the final recognition performance. Table 1 lists the recent major material recognition methods, which we roughly divided into four categories: "traditional feature": traditional methods using mathematical calculations; "CNN-based feature": methods using deep learning features; "ETE": end-to-end methods; and "feature fusion": feature fusion methods (see the next subsection to learn more about "feature fusion"). Table 1, feature fusion has shown great potential for material recognition. Currently, feature fusion methods can be divided into early, middle, and late fusions. Early feature fusion performs fusion at the raw feature level. For example, a group of heterogeneous features is first mapped into a semantic space. en, early feature fusion is performed by feature summation or feature concatenation. Representative early feature fusion models include canonical correlation analysis [39], deep canonical correlation analysis [40], multiple canonical correlation analysis [41], cross-modal correlation learning [42], kernel discriminant correlation analysis [43], and consistent discriminant correlation analysis [44]. Early feature fusion is intended to mine low-level correlations among different features. Unlike early feature fusion, middle feature fusion builds a more powerful classification model using a group of linearly weighted features. Representative middle feature fusion models include multiple kernel learning [45], efficient range gene selection [16], and multiple kernel boosting [46]. Middle feature fusion attempts to mine the implicit featureshared knowledge among the heterogeneous features. Finally, late feature fusion integrates the multiple predictions of a set of weak classifiers to build a more powerful classification model. Representative late feature fusion methods include AdaBoost [47], XGBoost [48], categorical boosting (CatBoost) [49], and light gradient boosting machine (LightGBM) [50]. Unlike early and middle feature fusions, late feature fusion takes full advantage of the complementarity among different classifiers.

Feature Fusion. As shown in
Recently, an increasing number of feature fusion methods have played an important role in material recognition (Table 1). e MDCA model uses a single-layer early feature fusion method for material recognition [15]. However, the effective discriminative information is not balanced between the input and output interfaces of the MDCA model. To resolve this problem, the HMF 2 model implements multilayer early feature fusion using residual connections [17]. It can fully mine the cross-modal knowledge among a group of heterogeneous image features. However, the HMF 2 model requires numerous features and multiple fusion operations, which may slow down its real-time efficiency. Moreover, it ignores the feature-shared knowledge and, thus, may miss some important material knowledge for effective material recognition. Unlike the early feature fusion approaches, the GS-AdaBoost and GS-XGBoost models perform middle feature fusion using SIFT, Gist, LBP, and VGG16 to complete the material recognition [16]. ey can fully capture featureshared knowledge to improve the performance of the final recognition. However, these two GS models ignore the early stage correlations among the heterogeneous features. In other words, the aforementioned feature fusion methods usually employ a single and independent feature fusion strategy. Moreover, only a few studies have explored the type of progressive relationship between the different feature fusion approaches for more effective material recognition.
In summary, a novel progressive feature fusion framework that can make full use of different fusion methods is required to resolve the abovementioned issue. e progressive feature fusion method can mine sufficient and effective knowledge among the heterogeneous SENet features for high-quality material recognition.

Our Approach
e overall scheme of the proposed progressive feature fusion approach is illustrated in Figure 1. In this section, we first present the heterogeneous SENet features. Five types of heterogeneous SENet features are used in our approach. Based on the SENet features, we then propose a progressive feature fusion approach. First, we mine the cross-modal knowledge among the heterogeneous SENet features using early feature fusion and generate a group of DVSs with more discriminant ability. With the use of permutation and combination, a total of 10 different DVSs are then generated. e 10 DVSs mainly extract the more effective discriminative information from the heterogeneous SENet features. e DVSs are more powerful than the original SENet features. en, we mine the implicit feature-shared knowledge progressively among these DVSs using middle feature Scientific Programming 3 fusion. e feature-shared knowledge is used to train our material recognition model. "Feature-shared" means that discriminative information is complementary between different features. Hence, in this stage, feature-shared knowledge among different DVSs is shared with all features to complete the middle feature fusion and obtain better recognition performance. Finally, the training procedure is described below. In addition, in this study, we used two benchmark datasets, namely, the Fabric and the MattrSet, to complete all experiments. e Fabric dataset [51] is a finegrained dataset and includes images of clothes made from any of these 9 materials: cotton, wool, terrycloth, fleece, nylon, silk, denim, viscose, and polyester. It contains 5064 images. e MattrSet dataset [16] is a coarse-grained dataset that includes images of bags (polyurethane (pu), canvas, nylon, and polyester) and shoes (pu and canvas). It contains 11,021 images (see Tables 2 and 3 for more details about these two datasets).  [32] Describable texture dataset 74.70 FV-CNN [32] KTH-TIPS2b dataset 81.80 IFV + VGG [14] KTH-TIPS2b dataset 81.50 IFV + DFB [14] KTH-TIPS2a dataset 88.60 IFV + DFB [14] Flickr material dataset 82.70 IFV + DFB [14] Describable

Heterogeneous SENet Features.
As discussed above, deep learning features [1,2,14] can better characterize images of materials. Current mainstream CNN models usually improve performance by using a spatial dimensional layer. Unlike the models in [52,53], the SENet model [23] can adaptively recalibrate channel-wise feature responses by explicitly modeling the interdependencies between different channels. Moreover, it pays more attention to the features that contain the most effective information. Hence, the SENet features outperform the ResNet features in most CV tasks. erefore, we chose SENet rather than ResNet as the basis for the subsequent progressive feature fusion (the results of extensive experiments also validated our selection, see Table 4). In this study, we used five types of heterogeneous layers of SENet, namely, SE-ResNet 50 (SR50), SE-ResNet 101 (SR101), SE-ResNet 152 (SR152), SE-ResNeXt 50 (SRXt50), and SE-ResNeXt 101 (SRXt101). ey contain abundant but complementary semantic information. erefore, SENet features can achieve better recognition performance. is means that these heterogeneous features point to the same or similar material knowledge. Considering these factors, we attempt to make full use of the complementary knowledge among the SENet features using the progressive feature fusion approach.

Progressive Feature Fusion.
Our progressive feature fusion approach is illustrated in Figure 2. e heterogeneous SENet features point to the same or similar material knowledge. Inspired by MDCA [15] and HMF 2 [17], we first attempt to obtain the implicit cross-modal knowledge from these SENet features using early feature fusion. We extract the average pooling layer (2048 dimensions) of the five heterogeneous SE modules introduced above. en, we analyze the cluster canonical correlation among all the SENet features and generate 10 DVSs. ese DVSs fully mine the effective discriminative information among the heterogeneous SENet features. is means that the semantic distance between the samples in the same class shrinks, whereas the corresponding semantic distance between the samples in different classes expands gradually. Hence, the classifier can better distinguish different material categories and obtain a better recognition performance (see Section 5.4). e material dataset is defined as D � (x n , y n ), x n ∈ R k , y n ∈ C, n � 1, 2, . . . , N}. Here, N is the number of samples. y n is the material label of the sample x n . C � c 1 , c 2 , . . . , c l is the set of all material labels, c r ∈ C, and l is the number of material categories. X and Y are the two heterogeneous SENet features. We divide the two SENet features into "X * 1 , Y * 1 " (for training) and "X * 2 , Y * 2 " (for testing). X and Y have classes c l , T x � X c 1 , X c 2 , . . . , X c l , and |Y c l | represent the data of X * 1 and Y * 1 in class c r , respectively. ω and ] are the projection vectors of X * 1 and Y * 1 , respectively. e correlation coefficient ρ between X and Y is computed as follows: (1) e covariance matrices C XY , C XX , and C YY are, respectively, defined as follows:  where S � l 1 |X c l ||Y c l | is the total logarithm of the paired relationship between X * 1 and Y * 1 . According to ω and ], X * 1 , X * 2 , Y * 1 , and Y * 2 are transformed into a set of mapped feature matrices X * respectively. en, U and V are constructed according to the mapped feature matrices, as follows: Finally, we fuse U and V to generate 10 DVSs for performing middle feature fusion progressively. Unlike in HMF 2 , we perform early feature fusion only once and only adopt the SENet features rather than numerous different features: As shown in Figure 2, to obtain the feature-shared knowledge from the DVSs, we perform middle feature fusion through gene selection. First, the DVSs are combined to form a group of DVS combinations. Second, we employ a general classifier to predict the estimated probability of each DVS in any DVS combination. en, the corresponding gene selection weight of each DVS is calculated, and a sum function is used to fuse the weighted estimated probabilities. Finally, we use a max decision function to generate the final recognition result.    (7) where r + dr and r − dr represent the upper and lower boundaries, respectively, of the effective range of DVS i for the samples c r . u dr and σ dr represent the mean value and standard deviation of DVS i , respectively, for the samples c r . p r (0 < p r < 1) is the estimated probability of the samples c r , and (1 − p r ) reduces the impact of the standard deviation σ dr on the upper and lower boundaries of the effective range. c is a constant derived from Chebyshev's inequality, which is equal to 1.732.
We compute the overlap area OL d and the corresponding coefficient OC d of DVS i . Hence, the gene selection weight w d of DVS i is calculated according to OC d . e probability p dk is weighted by this weight: Lastly, we use a max decision function to generate the final recognition result. e material label is determined according to max(p d ) � max sum(p dk * w d ) . In summary, the implicit complementary knowledge among the heterogeneous SENet features is fully mined by the proposed progressive feature fusion method. e SECGS model is shown in Algorithm 1.

Training Procedure.
Owing to its generality, the proposed SECGS model can use any type of general classifier to complete the final prediction. e proposed model was trained using four-fold cross-validation. In addition, the trained SECGS model was deployed on a laptop with the following configuration: Intel Core i7-7700 CPU (2.8 GHz), 16 GB of RAM, and GTX 1050 GPU. Hence, online real-time material recognition can be implemented easily on a laptop with a similar or higher configuration.

Experiments and Results
In this section, we evaluate the effectiveness, efficiency, and robustness of our approach on two benchmark datasets: (1) the Fabric and (2) the MattrSet datasets.

Datasets.
e Fabric dataset is a fine-grained dataset constructed by Kampouris et al. [51] using photometric stereoscopic sensors to collect clothing samples from physical stores and determine the final material according to the clothing label. It contains images of clothes made from any of these 9 materials: cotton, wool, terrycloth, fleece, nylon, silk, denim, viscose, and polyester. e Fabric dataset contains 1266 samples, according to different light intensities, visual angles, and other conditions. Each sample includes four images; thus, it has a total of 5064 images (Kampouris et al. cropped all the images to 400 × 400 pixels to avoid blurring the edges of the images). Table 2 provides more detailed information about the Fabric dataset. e MattrSet dataset [16] is derived from real commodities on the web and constructed under the guidance of an experienced material expert. It is a coarse-grained dataset that includes images of bags (polyurethane (pu), canvas, nylon, and polyester) and shoes (pu and canvas). It contains 11,021 images (the corresponding image resolution ranges from 123 × 123 to 4300 × 2867, which makes material recognition more challenging). Table 3 provides more detailed information about the MattrSet dataset.

Baselines.
We used five general classifiers, namely, logistic regression (LR), k-nearest neighbor (KNN), support vector machine (SVM), AdaBoost (Ada), and XGBoost (XGB), to train the proposed SECGS model (e.g., "SECGS 5-LR " means that the LR classifier uses five DVSs to implement middle feature fusion). We compared our model with the baseline models listed below. e goal of this study was to obtain a better recognition performance using progressive feature fusion. We aimed to improve the recognition performance of our model on fine-grained and coarse-grained material images. Hence, our work can be regarded as a type of fine-grained image classification task. erefore, we also compared the proposed model with some recently proposed fine-grained image classification models: (1) Some well-known late feature fusion models are as follows: AdaBoost [47], XGBoost [48], CatBoost [49], and LightGBM [50]. For a fairer comparison, we also used the DVSs generated by our method to train these models. (2) Recently proposed material recognition models are as follows: GS-XGB [16], GS-Ada [16], SVM-S [17], MDCA-XGB [17], HMF 2 [17], and Kampouris's Albe + Norm [53]. (3) Recently proposed deep learning models are as follows: InceptionResNetV2 [54], DenseNet169 [55], SENet [23], and ResNeXt WSL [56]. (4) Some well-known fine-grained classification models are as follows: DCL [57], NTS-Net [58], PMG [59], WILDCAT [60], WSCNet [61], and WS-DAN [62]. All these models are trained by using the four-fold cross-validation with an initial learning rate of 1e −3, a batch size of 16 and trained for 40 epochs. Hence, the two models also use the progressive feature fusion method. We wanted to validate the robustness of this method and to demonstrate the effectiveness of the "GS" component.

Evaluation Metrics.
We employed accuracy (Acc.) (equation (12)) and precision (equation (13)) to evaluate our model. TP indicates true positive; FP, false positive; TN, true negative; and FN, false negative. Moreover, we designed a new metric called the average accuracy for each feature (AA F ) (equation (14)) to comprehensively evaluate each feature. Similarly, we propose a new metric called the average accuracy for each classifier (AA C ) (equation (15)) to comprehensively evaluate each classification model. Finally, we used the mean average accuracy (MAA) metric (equation (16)) to evaluate the overall performance of the proposed model: precision � TP all TP all +FP all , Moreover, as shown in equation (17), ΔAccuracy 1 calculates the absolute performance improvement of the AA F metric of the SENet features relative to that of the ResNet features.
is metric can evaluate the effectiveness of the SENet features. As shown in equation (18), ΔAccuracy 2 calculates the absolute performance improvement of the AA F metric of the DVSs relative to that of the SENet features. is metric evaluates the effectiveness of the generated DVSs. Table 5 shows the accuracy of the SEC model that uses different DVSs. (ii) Divide the two features into the training and testing data. (iii) Use the training data to perform a cluster canonical correlation analysis and to obtain the projection vectors ω and ].

Evaluation of the SEC Model.
(iv) Use ω and ] to weight the training and testing data, respectively; therefore, U and V are constructed using equation (5).
(v) Use equation (6) to obtain the corresponding DVSs according to the two features.
Until any pair of features in S 1 has been traversed.  Scientific Programming modules can be applied to different classification models. Similar results were also obtained on the MattrSet dataset, which confirms the robustness of the SENet features. Certainly, a larger performance improvement was obtained on the Fabric dataset owing to the fine-grained visual information contained in its samples. According to ΔAccuracy 2 , the maximum improvement of the DVSs relative to that of the SENet features on the Fabric dataset is 3.35%, and the corresponding average improvement is 2.48%. Similar performance improvements were also obtained on the MattrSet dataset. ese results indicate that the proposed DVSs are effective for material recognition on any dataset and that the proposed early feature fusion method is effective and robust. Moreover, the DVS (SR50-SRXt101) achieved the best accuracy on each dataset.
is indicates that sufficient but effective crossmodal knowledge can be mined among the heterogeneous SR50 and SRXt101 features. Owing to this characteristic, even traditional classifiers, including LR and SVM, were able to obtain satisfactory recognition performance. It is worth noting that the corresponding performance on the Fabric dataset was better than that on the MattrSet dataset because the former dataset contains much more valuable visual semantics, whereas the latter dataset contains certain noise (Tables 2 and 3).
According to AA F and MAA in Table 6, we can obtain the overall recognition performance of different features from a statistical perspective. erefore, many more meaningful conclusions can be drawn. According to AA F , the proposed DVSs easily outperformed the SENet and ResNet features on each dataset. A larger performance improvement was obtained on the Fabric dataset. Hence, the proposed DVSs are robust and effective for material recognition. e MAA metric further validates this conclusion. For example, on the Fabric dataset, a performance margin of 41.33% was obtained between DVS and ResNet. ese results clearly show that there is indeed a large amount of complementary information among the heterogeneous

Scientific Programming
SENet features and that the proposed early feature fusion method can capture such information to improve the recognition performance. In addition, even traditional models, such as LR, KNN, and SVM, were able to achieve better performance after using early feature fusion owing to the powerful discriminative ability of the proposed DVSs. In summary, the DVSs generated by the proposed early feature fusion method are effective and robust for material recognition, providing a solid basis for the subsequent middle feature fusion. Sufficient cross-modal knowledge among the heterogeneous SENet features was mined to improve the recognition performance. Moreover, all the SENet features obtained satisfactory performance, which firmly supports our original selection.

Evaluation of the SEC(s) and SEC(h) Models.
In this section, we evaluate two variants of the SECGS model, namely, SEC(h) and SEC(s). We want to determine whether the voting strategy contributes to improving recognition performance. More importantly, we intend to validate the robustness of the proposed progressive feature fusion method. In addition, we want to demonstrate the effectiveness of the "GS" component from another perspective. Figure 3 shows the AA C values of the top three DVSs combined with the corresponding voting strategy.
As shown in Figure 3(a), the SEC(h) model that uses SR101-SRXt101 with the top three classifiers achieved the best accuracy (91.51%) on the Fabric dataset. It outperformed the original DVS without voting by a large margin.
us, we should carefully tune the number of classifiers for the final voting. Similarly, as shown in Figure 3(b), the SEC(s) model that uses SRXt101-SR152 with the top two classifiers obtained the best accuracy (93.19%) on the Fabric dataset. It also outperformed the original DVS without voting and the best SEC(h) model. erefore, the soft voting strategy is more suited for the Fabric dataset.
As shown in Figure 3(c), the SEC(h) model that uses SR50-SRXt101 with all the classifiers obtained the best accuracy on the MattrSet dataset. It easily outperformed the original DVS without voting. Similarly, as shown in Figure 3(d), the SEC(s) model that uses SR50-SRXt101 with all the classifiers achieved the best accuracy. Unlike on the Fabric dataset, we require more classifiers to complete the voting-based late feature fusion on the MattrSet dataset owing to the much larger amount of noise information in its samples. More heterogeneous classifiers can make a more definite prediction. In addition, a larger performance improvement can be obtained on the MattrSet dataset if we use a soft voting strategy. is further indicates that the corresponding recognition of coarse-grained material images remains a significant challenge. In general, the soft voting strategy is also more suited for the MattrSet dataset.
In summary, both the SEC(h) and SEC(s) models achieved satisfactory performance on each material dataset.
is also validates the robustness of the proposed progressive feature fusion method from another perspective. Hence, the voting-based late feature fusion method is another useful substitute for the proposed middle feature fusion method. Unlike HMF 2 [17], the SEC model only adopts the SENet features, which helps reduce its complexity.
is is an important step in narrowing the gap between theoretical research and practical applications.

Evaluation of the SECGS Model.
In this section, we perform a more comprehensive evaluation of the proposed SECGS model. Table 7 shows the corresponding experimental results. We show the performance variations with the increase in the number of DVSs on each dataset. For example, "2-SVM" means that the SVM classifier uses two DVSs to implement the middle feature fusion . "Best Acc." represents the best accuracy among all the DVS combinations. "Δmax" represents the corresponding performance improvement of the SECGS model compared with the best SEC model described above. Some important conclusions can be drawn from Table 7.
(1) On the MattrSet dataset, the corresponding performance rank of the different classifiers in descending order was "SVM > XGBoost > LR > AdaBoost > KNN." On the Fabric dataset, it was "LR > KNN > XGBoost > AdaBoost > SVM." ese can help us with the selection of the optimal classifier for the SECGS model. e SECGS model prefers SVM on the coarse-grained MattrSet dataset, whereas it prefers LR on the fine-grained Fabric dataset. A similar result can be seen in Table 5. us, owing to the progressive feature fusion method, sufficient and effective knowledge was mined, enabling even the traditional classifiers to achieve satisfactory recognition performance.
(2) We found that, regardless of the number of DVSs, the SVM classifier always achieved the best performance on the MattrSet dataset and the LR classifier on the Fabric dataset. ese findings are further illustrated in Figure 4. ey indicate that the most suitable classifier should be chosen carefully for different material datasets to achieve the largest performance improvement. (3) According to Δmax, the proposed middle feature fusion approach outperformed the voting-based late feature fusion approach. e feature-shared knowledge was fully mined using our progressive feature fusion model. Evidently, the proposed "GS" method is more effective than the voting-based late feature fusion method. Unlike the SEC model, our model adopts only one classifier. In addition, larger performance improvements were obtained on the fine-grained Fabric dataset. A similar conclusion can be drawn from the above subsection "Evaluation of the SEC Model." Interestingly, owing to its different implementation mechanisms, the proposed SECGS model will be a powerful supplement to the wellknown HMF 2 model (Table 8).   Figure 4. Hence, the proposed middle feature fusion strategy only requires careful tuning of the number of DVSs to capture sufficient feature-shared knowledge for more effective material recognition.
In conclusion, the proposed middle feature fusion approach outperformed the voting-based late feature fusion approach. Moreover, our progressive feature fusion-based SECGS model is effective and robust in recognizing both fine-grained and coarse-grained materials.

Comparisons with SOTA Baseline Models.
In this section, we compare the SECGS model with a group of SOTA baseline models. Detailed accuracy comparisons are presented in Table 8. On the MattrSet dataset, the SECGS 3-SVM model achieved a very competitive performance compared with that of the best baseline (HMF 2 ). Certainly, material recognition remains a challenge on this dataset. On the Fabric dataset, the SECGS 5-LR model outperformed all the baseline models, obtaining an approximately 10.37% performance improvement compared with that of the most competitive material recognition model recently proposed (HMF 2 ). Hence, unlike the baseline models, the SECGS model achieved a more balanced recognition performance.
is is mainly attributed to the diverse knowledge learned by the progressive feature fusion method. However, unlike GS-AdaBoost and GS-XGBoost, our model adopts only the SENet feature. Moreover, unlike HMF 2 , our model performs early feature fusion only once. All these helps improve its real-time efficiency (Table 9).
In addition, on the Fabric dataset, the SECGS model achieved an approximately 1.11% performance improvement compared with that of the most competitive fine-grained classification baseline (DCL).
is means that the SECGS model can mine sufficient knowledge to obtain better performance. On the MattrSet dataset, it performed comparatively well to the most competitive fine-grained classification model (PMG). is also demonstrated the effectiveness of the knowledge mined by our method.
We also found that our variants, SEC LR and SEC(s), performed well (e.g., on the Fabric dataset, SEC(s) outperformed the HMF 2 model, with an improvement of 8.85%). e experimental results also illustrate an aspect of the ablation analysis; the corresponding accuracy in descending order was "SECGS > SEC(s) > SEC > general classifier." is further confirms the effectiveness of our model, which can learn diverse valuable knowledge through progressive feature fusion. Notably, Figures 4 and 5 show another perspective of the ablation analysis. e corresponding comparisons of the real-time efficiency are presented in Table 9. We performed our experiments using the recognition system we deployed on a laptop ( Figure 6). For instance, on the MattrSet dataset, our model required only 13.26 s to test a material image, which is 5.43 s faster than the HMF 2 model. In summary, the results given in Table 9 indicate that, in addition to its effectiveness and robustness, the SECGS model is also more efficient than the SOTA baseline models.
In summary, the proposed SECGS model is effective, efficient, and robust for material recognition on both the Fabric and MattrSet datasets.

Cross-Validation Analysis.
Owing to the scarcity of material samples, cross-validation is required to objectively evaluate the proposed material recognition model. In this section, we evaluate two-fold and four-fold cross-validations to show the robustness of our method. Table 10 shows the corresponding performance comparisons. Here, we only compared three features, namely, ResNet, SENet, and DVS, which is sufficient to validate our choice.
As shown in Table 10, the SENet feature beats outperformed the ResNet feature, whereas DVS outperformed the SENet and ResNet features by a large margin. It is reasonable to mine the implicit cross-modal knowledge among the heterogeneous SENet features. In addition, the performance of the four-fold cross-validation was better than that of the two-fold cross-validation, which further indicates that the proposed DVSs are robust for material recognition and that they help enhance the practicability of the SECGS model. Certainly, material recognition remains a challenge on the MattrSet dataset. In conclusion, we used four-fold cross-validation to complete all experiments.

Comparisons of the Summation and Concatenation Modes in Early Feature
Fusion. e proposed early feature fusion method uses two fusion modes, namely, summation (SUM) and concatenation (CON), to create the DVS. Ten DVSs were chosen to evaluate the effectiveness of these modes, and the corresponding results are shown in Figure 7. In the evaluation, we used five classifiers and four-fold cross-validation to perform material recognition, and the values of the AA C metric were used to draw the line graphs.
As shown in Figure 7, the CON mode was superior to the SUM mode on each benchmark dataset. e SUM mode reduced the dimension of the fused features (DVSs), thus diluting   the effective cross-modal knowledge. In contrast, the dimension generated by the CON mode was more appropriate, as it retained the maximum clustering canonical correlations among the heterogeneous SENet features. erefore, our model uses the CON mode to complete the early feature fusion.

Evaluation of the DVSs.
In this section, we further make a comprehensive validation of the effectiveness of the proposed DVSs. Table 4 shows the performance of the corresponding DVS generated using different features. DVS (re) represents the DVS generated by the ResNet features, including ResNet 50, ResNet 101, ResNet 152, ResNeXt 50, and ResNeXt 101. DVS (re-se) represents the DVS generated by the ResNet and SENet features. DVS (se) represents the DVSs generated by the SENet features. ΔImprove 1 represents the performance improvement of the DVS (re) compared with that of the corresponding ResNet feature. ΔImprove 2 represents the performance improvement of the DVS (re-se) compared with that of the corresponding SENet feature. ΔImprove 3 represents the performance improvement of the DVS (se) compared with that of the corresponding SENet feature. Several valuable conclusions can be drawn as follows: (1) According to MAA, there was an obvious upward trend with the adoption of the SENet features. is indicates that the SENet features, rather than the ResNet features, can provide more valuable knowledge for material recognition. is finding validates our choice of the SENet features as the basis for our progressive feature fusion method. (2) According to ΔImprove 1 , the corresponding DVSs achieved satisfactory performance even if we only used the ResNet features, which shows the robustness of the proposed early feature fusion method. Moreover, a larger MAA improvement was obtained on the MattrSet dataset, which confirms that a significant amount of cross-modal knowledge among the ResNet features was mined out from the MattrSet dataset. (3) According to ΔImprove 2 , performance improvements were not always positive when only one SENet feature was used to complete the early feature fusion. ere must be some noise among the ResNet and SENet features, which significantly affected the quality of the acquired cross-modal knowledge (DVS) and led to a performance decline. is suggests that we should use the CNNbased features from a homogeneous network to complete the proposed early feature fusion, which helps obtain high-quality cross-modal knowledge for improving the performance of the final recognition. (4) According to ΔImprove 3 , the performance improvements were always positive when only the SENet features were used for the early feature fusion. High-quality cross-modal knowledge was fully mined to generate the proposed DVSs. is also validates our original selection and serves as a basis for the subsequent middle feature fusion.
In conclusion, the proposed DVSs mined from the heterogeneous SENet features are effective and robust for high-quality material recognition.

Visualization of the Proposed DVSs.
To further illustrate the effectiveness of the proposed DVSs, we employed the well-known t-SNE [63] algorithm to visualize some relevant features. Taking the Fabric dataset as an example, we chose several features, including ResNet 50 (R50) (similar to ResNeXt 101 (RXt101)), SR50, SRXt101, R50-RXt101 (DVS (re)), SR50-RXt101 (DVS (re-se)), and SR50-SRXt101 (DVS (se)), for the visualization. e corresponding experimental results are shown in Figure 5. Many valuable conclusions can be drawn as follows: (1) e original R50 feature represented very messy material categories. Hence, the traditional deep learning features are insufficient for describing different material categories (see Tables 6 and 10 for the quantitative results).
(2) e original SR50 represented the material categories better than the R50 feature. However, confusion still occurred, as shown in Figure 5(b). A similar result can be seen in Figure 5(c). erefore, the original deep learning-based features are insufficient for depicting material images.
(3) DVS (se) can represent each material category more accurately than the single features (R50, SR50, and SRXt101). e same color blocks (samples in the same category) started to aggregate, whereas different color blocks (samples in different categories) began to separate from each other. is means that the semantic distance between the same samples shrank, whereas that between the different samples expanded gradually, which helped the classification model to train a more appropriate decision function (Table 5). When the proposed DVSs were used, even the traditional classification models, including SVM, KNN, and LR, were able to obtain satisfactory recognition performance.
(4) Comparing the three types of DVS (DVS (re), DVS (re-se), and DVS (se)), we obtained the performance rank in the descending order as follows: "DVS (se) > DVS (re-se) > DVS (re)." According to Figures 5(d)-5(f ), DVS (se) is better than DVS (re) in representing different material categories. e SENet features, rather than the ResNet features, provide more effective knowledge for early feature fusion. is is an intuitive explanation of the results presented in Table 4.
is also validates our original selection. From Figures 5(a)-5(d), we can see that DVS (re) had a better ability to represent different material categories than that of the original R50, which shows the effectiveness of the proposed early feature fusion method from another perspective.  According to the comparisons between Figures 5(e)-5(f ), the material categories depicted by DVS (re-se) are messier than those depicted by DVS (se). is indicates that there is a certain amount of noise between the ResNet and SENet features, which might have affected the final quality of the acquired crossmodal knowledge. Moreover, this also reduced the ability of DVS (re-se) to characterize different material categories. All these findings also confirm the conclusions we mentioned in the above subsection "Evaluation of the DVSs." Figure 4 shows the corresponding recognition performance of all the DVS combinations in the proposed middle feature fusion procedure. As can be seen from the median line, the accuracy of each classifier tends to improve with the increase in the number of DVSs. is means that, as the number of DVSs increases, the SECGS model can acquire sufficient featureshared knowledge to become more stable (similar results can also be seen in Table 7).

Evaluation of All the DVS Combinations.
For the MattrSet dataset, an accuracy peak was observed when the SVM classifier used three DVSs. For the Fabrics dataset, an accuracy peak was observed when the LR classifier used five DVSs. ese results are consistent with the best values given in Table 8. Moreover, the SVM classifier achieved the best overall performance on the MattrSet dataset, whereas the LR classifier achieved it on the Fabric dataset. Hence, even the traditional classifiers were able to obtain satisfactory performance using the proposed progressive feature fusion method owing to the diverse knowledge obtained by our model. is finding is more evident on the MattrSet dataset.
In addition, the performance variations of our approach showed a positive effect on the number of DVSs, which indicates that more DVSs can obtain more stable performance (the lower edge (minimum) of each box increases with the number of DVSs, and the smaller space between the upper and lower edges indicates that the corresponding performance fluctuates less). ese results show that featureshared knowledge can be fully obtained using the proposed approach, thus improving the final recognition performance. In summary, like the cross-modal knowledge contained in the DVSs, the feature-shared knowledge among different DVSs can also be mined by the progressive feature fusion approach, and this knowledge is effective for highquality material recognition.

SR101-SRXt101
SRXt50 each material category. e corresponding results are shown in Figure 8. e evaluation was done as per the ablation analysis procedure, which helped us to objectively evaluate each component of the progressive feature fusion method. First, gradual performance improvements were observed on each dataset using different components. is observation was more evident on the Fabric dataset (e.g., fleece, silk, and wool). Second, the proposed DVSs easily outperformed the original SENet feature, which indicates that they, unlike the single image feature (SENet), obtained sufficient crossmodal knowledge from the heterogeneous SENet features to better depict material images (Tables 6 and 10 and Figure 5).
ird, the SECGS model outperformed the proposed DVSs, which means that implicit feature-shared knowledge was fully obtained through middle feature fusion to boost the final recognition performance. Moreover, our model performed better on the Fabric dataset (similar results can be seen in Tables 7 and 8). Owing to its different implementation mechanisms, our model will be a good complement to the SOTA HMF 2 model, which did not do well on the MattrSet dataset. However, the recognition of materials, such as bag_pu and bag_polyester, remains challenging. We also found that better but more stable performance was achieved by using the proposed middle feature fusion method than that by using ensemble learning (Table 7). e proposed progressive feature fusion method (early feature fusion + middle feature fusion) is more effective and robust than the SEC(s) model (early feature fusion + late feature fusion). Although ensemble learning performed well on some materials, the proposed middle feature fusion method achieved a more balanced performance improvement on most material categories. All these confirm the effectiveness and robustness of the proposed progressive feature fusion method from another perspective.
To further show the real contribution of each component of our approach, we performed another ablation analysis experiment and plotted the results in Figure 9. is figure shows the corresponding performance improvement after the ablation operation. "SE-SF" represents the largest performance improvement of the SENet feature relative to that of the best single feature. "DVS-SE" represents the largest performance improvement of the proposed DVSs relative to that of the SENet feature. "SECGS-DVS" represents the largest performance improvement of the proposed SECGS model relative to that of the generated DVS. "SEC(s)-DVS" represents the largest performance improvement of the proposed SEC(s) model relative to that of the generated DVS.
As shown in Figure 9, each component of the SECGS model tried its best to boost the final recognition performance. However, the contribution rank of the components in the descending order was "SENet > GS > DVS." Hence, the SOTA SENet feature is the most important component of the proposed SECGS model, whereas the effect of the proposed DVSs is relatively small. is conclusion is more evident on the Fabric dataset.
is also motivates us to design a more effective early feature fusion method to create novel DVSs in the future. We also found that our variant models (SEC(s)) obtained satisfactory recognition performance, which further confirms the robustness of the proposed progressive feature fusion method. Clearly, the proposed middle feature fusion method outperformed the voting-based late feature fusion method, which confirms the 18 Scientific Programming effectiveness of the implicit feature-shared knowledge mined by our GS-based middle feature fusion method.

Real-Time Material Recognition System
To verify the practicability of the proposed SECGS model, we designed a real-time material recognition system based on the SECGS model with two recognition modes that can be used through local uploading and instant shooting. e workflow of this system is shown on the left-hand side of Figure 6. First, users can upload local images or instantly shoot a photo using a mobile IP camera. Second, the system performs feature extraction and progressive feature fusion to complete the material recognition (see the right-hand side of Figure 6). e entire recognition operation is performed on the front page of the cloud application without any manual intervention. is can improve users' interaction experience and serve as an important basis for subsequent configurations on a cloud server.
As Table 9 shows, our SECGS model-based real-time material recognition system can achieve satisfactory efficiency compared with the SOTA baseline models. Hence, the SECGS model is efficient for real-time material recognition. If an advanced computer or server with a higher configuration is adopted, the corresponding real-time efficiency will be further improved. Unlike the HMF 2 and GS-XGBoost models, the SECGS model adopts only the SENet features to complete the progressive feature fusion. is can improve the final real-time efficiency from another perspective. Hence, the real-time material recognition system narrows the gap between theoretical research and practical applications (scene recognition, industrial detection, robot vision, instant shopping, etc.). For example, in the field of industrial detection, the system can assist textile enterprises in completing fabric defect classification more efficiently. Workers only need a mobile phone to perform this task more easily, thus improving work efficiency and reducing labor costs. For customers who are shopping, they only need a mobile phone to accurately identify the materials of the clothing goods they want to purchase, to assist them in their final purchasing decisions. e system can also be combined with an intelligent robot such as NAO [64] to complete the real-time clothing recognition and grasp estimation, which can better assist the elderly. In summary, our SECGS model-based real-time material recognition system can be seamlessly used with many daily applications to meet practical demands.

Conclusions and Future Work
We presented a model called SECGS using progressive feature fusion for material recognition. We demonstrated its effectiveness, efficiency, and robustness on two benchmark datasets. By considering progressive feature fusion among the SOTA heterogeneous SENet features, the proposed SECGS model can gather sufficient material knowledge, including cross-modal and feature-shared knowledge, to improve the final recognition performance. Moreover, it can be deployed on a laptop for online real-time material recognition, thereby demonstrating its high practicality. In the future, we plan to advance our research in the following aspects: (1) We intend to use some SOTA data augmentation methods, such as AutoAugment [65] and Adversarial AutoAugment [66], to solve the problem of unbalanced material images and establish a firm data foundation for high-quality material recognition. (2) We plan to apply the SECGS model to other research fields, including tumor recognition [67], COVID-19 detection [68], and scene recognition [69]. is can help validate the versatility of our model. (3) We will try to develop a material recognition mobile application, which can really help narrow the gap between theoretical research and practical applications.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.