A Deep Learning-based Global and Segmentation-based Semantic Feature Fusion Approach for Indoor Scene Classification

This work proposes a novel approach that uses a semantic segmentation mask to obtain a 2D spatial layout of the segmentation-categories across the scene, designated by segmentation-based semantic features (SSFs). These features represent, per segmentation-category, the pixel count, as well as the 2D average position and respective standard deviation values. Moreover, a two-branch network, GS2F2App, that exploits CNN-based global features extracted from RGB images and the segmentation-based features extracted from the proposed SSFs, is also proposed. GS2F2App was evaluated in two indoor scene benchmark datasets: the SUN RGB-D and the NYU Depth V2, achieving state-of-the-art results on both datasets.


I. INTRODUCTION
Scene classification is a computer vision task that, given the input data, predicts its scene category by analyzing the background and target objects [1]. It has been widely used in applications [2] such as video surveillance, human-computer interaction, and mobile robots, which makes the scene classification task still an open and challenging research topic.
The same scene category can have multiple furniture configurations, points of view, and scene area, so obtaining a feature space that covers all the aforementioned conditions may be difficult [3]. This intra-category variation issue needs to be taken into account, especially when the same category changes so much that only a few common patterns can be extracted. Moreover, as the number of scene categories increases, more appearance similarities are shared between categories, which leads to a negative impact on the interscene boundaries, making it difficult to obtain a correct scene category prediction [4].
Convolutional Neural Networks (CNNs) have significantly enhanced the performance of computer vision tasks. However, despite the performance boost in the scene classification task, their performance is still far below from those achieved in object recognition and classification tasks [4]. This may occur due to the intra-category variation and inter-category 1 Authors are with the University of Coimbra, Institute of Systems and Robotics, Department of Electrical and Computer Engineering, Portugal. Emails: {ricardo.pereira, tiagobarros, garrote, anacris, urbano}@isr.uc.pt. 2 Author is also with the Polytechnic Institute of Tomar, Portugal. ambiguity issues [5], which do not allow CNNs to obtain a clear pattern between the same scene categories, nor between different categories [5], leading to the need for new sources of information about the scene that, when mapped into a feature space, can disentangle the various categories.
To identify an indoor scene, people focus on the objects available in it, and also on how the objects are distributed across the scene [7]. Therefore, inspired by how people identify an indoor scene, recent works [3]- [5], [8] have been trying to overcome the intra-category variation and intercategory ambiguity issues by exploiting semantic information, such as object-related information and object correlations across the scene. To obtain semantic information, such works have relied on object detector techniques to locate the objects in the scene. Despite performance improvements, that validate the importance of semantic information in scene classification, most object detectors predict bounding boxes, which are well fitted for objects with square or rectangular shapes, but are less adequate for objects with more irregular bodies. Most objects do not have geometrically well-defined boundaries, which makes the object detectors not suited to provide a well-defined object shape estimation. On the other hand, a solution to this problem can be semantic segmentation approaches, which are able to identify objects at the pixel level without any geometrical constraints. With semantic segmentation approaches, any arbitrary boundaries can be predicted, which is an advantage when compared to object detectors. Recent works have leveraged this advantage, using segmentation masks to extract more precise and meaningful object-related features [4].
This paper presents a novel approach, segmentation-based semantic features (SSFs), that, considering a semantic segmentation mask, allows to obtain a semantic feature representation that depicts how object categories are spread across the scene. Such features represent, per object category, the pixel count, the 2D average position, and their respective standard deviation values. Thus, the proposed SSFs provide a 2D spatial layout of the object categories across the scene, as shown in Fig. 1, leading to a more meaningful semantic feature representation. Moreover, a two-branch CNN-based Global and Segmentation-based Semantic Feature Fusion Approach (GS 2 F 2 App), that exploits CNN-based global features and segmentation-based features, as shown in Fig. 2, is also proposed. Both branches' output features are concatenated for scene prediction. The proposed GS 2 F 2 App was evaluated on two indoor scene benchmark datasets: SUN RGB-D [9] and NUY Depth V2 [6]. The main contributions of this work can be summarized as follows: • A novel segmentation-based approach, SSFs, that represents how object categories are distributed in the scene; • GS 2 F 2 App: a two-branch network that exploits global features extracted from RGB images and segmentationbased features; • The GS 2 F 2 App was evaluated on the SUN RGB-D dataset [9] and on the NYU Depth V2 dataset [6], of which, to the best of our knowledge, state-of-the-art results were achieved on both datasets.
The results of this work suggest that the proposed SSFs have a significant impact on the classification performance, achieving state-of-the-art performance when segmentation masks with sufficient quality are provided. As expected, the performance of the segmentation network has a significant impact on the proposed SSFs features, since good segmentation masks, with well-defined object boundaries, lead to higher quality SSFs features, while poor mask predictions deteriorate the quality of the extracted SSFs. However, the generalization capability of the segmentation network is not a goal in this work, instead, the goal is to assess the relevance of the segmentation mask as a semantic information source for scene classification.

II. RELATED WORK
Most indoor scene classification pipelines, in the literature, are mainly composed of three main stages [2]: feature extraction to acquire visual characteristics of the scene, feature transformation to obtain a scene representation, and classification based on the generated scene representation. However, due to a paradigm change imposed by the promising performances achieved by CNNs, that can extract complex and rich features [10], [11], the aforementioned main steps have been compressed into single end-to-end networks [7], [12], [13].
In an attempt to get more meaningful information about the scene, which can lead to better scene representation, different types of feature transformations and new sources of information, such as object correlations [3], [8], spatial layout patterns [12], and discriminative region detection [5], [14], have been proposed in the literature [2]. In recent works, in a direct or indirect way, approaches that take leverage of the objects recognized in the scene have been widely exploited [3], [8], [14]- [19]. Pereira et al. [15] exploited CNN-based global features with the occurrence of objects recognized in the scene. Later, an improvement was proposed in [3], where inter-object distance relationships were also exploited. Song et al. [8] proposed two object-based image representations that exploit the relationships between co-occurring objects and relationships from a sequential occurrence of objects. On the other hand, Mosella-Montoro & Ruiz-Hidalgo [20] simultaneously exploited CNN-based global features and 3D geometric features. Geometric features were extracted from a 3D point cloud obtained by a multi-neighborhood graph convolutional network. Other works, such as [7], [8], [12], [13], [20] improved their feature quality by extracting multimodality features, using RGB and depth images.
Recent works used semantic segmentation masks, instead of object-bounding boxes, to also exploit object correlations across the scene [4], [21], [22]. Ahmed et al. [22] proposed a multiple object categorization method that uses learning kernels in segmentation masks to obtain representations of the object categories over the scene, which are later used for scene classification. Herranz-Perdiguero et al. [21] used the segmentation mask to locate the recognized objects in the scene and also to generate pixel distributions of the objects in a histogram-like approach. On the other hand, López-Cifuentes et al. [4] proposed a two-branch network that simultaneously exploited global features extracted from an RGB image and local semantic features extracted from a semantic segmentation mask. The proposed work, GS 2 F 2 App, shares some similarities with [4]. Both works exploit CNNbased global features and semantic features through a semantic segmentation mask. However, while the GS 2 F 2 App extracts the proposed SSFs from the segmentation mask and combines global features with semantic features through a concatenation operation, the work presented in [4] uses the segmentation mask as a CNN's input and combines global and semantic features through an attention module.

III. METHODOLOGY
An overview of the proposed GS 2 F 2 App is presented in Fig. 2. It is composed of a global branch and a semantic branch. Both branches' output features are concatenated. Then, additional feature correlations are learned and a scene class prediction is obtained.

A. Semantic Branch
A scene can be defined by the unique semantic information available and recognized in it, such as the objects and their spatial distribution. Hence, to obtain more meaningful information about the objects and their spatial distribution in a given scene, SSFs, which are extracted from semantic segmentation masks, are proposed.
In order to predict the semantic segmentation masks from RGB images, the DeepLabv3+ [23] encoder-decoder-based semantic segmentation network is used. The advantage of this network relies on the approach of mask predictions, that combines the decoder module, which recovers gradually the spatial information leading to sharper object boundaries, with a spatial pyramid pooling, which encodes multi-scale contextual information by applying pooling operations at multiple rates.

Segmentation-based Semantic Features (SSFs)
To leverage the objects' category information available in a segmentation mask, the following SSFs, per object category are proposed: i) the pixel count (P C ); ii) the 2D average position (I µx , I µy ); iii) 2D standard deviation values based on each 2D position across the scene (I σx , I σy ).
Given a set of object categories L = {0, 1, ..., l}, a segmentation mask S m = {p i,j |p i,j ∈ L}, with size w × h, being the index i = [1, ..., h], the index j = [1, .., w], and the pixel value (p i,j ) that represents an object category. Hence, the pixel count per object category (P C ), which represents how much a specific object category is present in the scene, is calculated as follows: where n represents an arbitrary object category. Also, P C , a normalized value of P C is obtained by dividing the latter by the total number of pixels. The 2D average position (I µx , I µy ), regarding each object category, provides spatial information that represents the average location (in the image space) where each object category is mainly present in the scene. It allows obtaining a meaningful 2D spatial layout of the object categories across the scene. The 2D average position is calculated by using the segmentation mask's indexes (i, j) that correspond to a specific object category (n), as expressed in (4) and (5). Both values are normalized by dividing by w and h respectively, as expressed in (6) and (7).
The 2D standard deviation values (I σx , I σy ), per object category, also provide spatial information representing how close or apart pixels of the same object category are. A high standard deviation value means that the object category is spread out across the scene, while a low standard deviation value means that the object category is concentrated in a small region of the scene, as shown in Fig. 1. The 2D standard deviation values, along with the 2D average position provide relevant spatial information of the object categories across the scene, which allow to obtain a more meaningful 2D spatial layout of the object categories over the scene. The 2D standard deviation values are also calculated using the the segmentation mask's indexes (i, j) that correspond to a specific object category (n), as expressed in (8) and (9). Both values are also normalized by dividing by w and h respectively, as expressed in (10) and (11).

SSFs-CNN Architecture
To further exploit the correlation and non-linear patterns among the proposed SSFs, a small CNN architecture, SSFs-CNN, as shown in Fig. 2, is proposed. It is composed of three 2D convolutional layers with 64, 128, and 64 output channels, respectively. All layers are composed of a kernel size of 3, a stride value of 1, and a padding value of 1. In the end, a fully-connected (FC) layer, FC2, is also implemented. All convolutional and FC layers use the ReLU as the activation function.

B. Global Branch
To learn and extract global features from RGB images, the convolutional layers of a state-of-the-art CNN are used. Moreover, the top branch of the proposed GS 2 F 2 App is also composed of a FC layer (see FC1 in Fig. 2). To exploit the effectiveness of the different deep learning-based feature extraction techniques in indoor scene classification tasks, and also to show that the proposed SSFs do not have a global feature extraction dependency, the feature extraction layers of the following six state-of-the-art CNNs, also designated as backbone networks, were used: VGG16 [24], ResNet18-50-101 [10], DenseNet [11], and MobileNetV2 [25].

C. Feature Fusion
To further evidence the effectiveness and the positive impact that the proposed SSFs may have in obtaining a more complete feature representation when combined with CNNbased global features, a two-step learning feature fusion technique, similar to the one used in [3], is used. In the first learning step, an RGB-only base model is obtained, i.e., only the CNN used to learn and extract global features, as well as the FC1 layer, are trained. In the second learning step, the weights trained in the previous learning step are frozen, and the remaining weights are trained. The model learns to extract the best features from the generated SSFs, and also how to combine global and segmentation-based features. Global and segmentation-based features are combined by concatenating both feature vectors, FC1 and FC2, respectively. Then, for a scene class prediction, the resulting feature vector goes through two FC layers, FC3 and FC4, being the ReLU activation function applied in FC3.

IV. EXPERIMENTS
The proposed approach was evaluated on two scene classification datasets: SUN RGB-D [9], and NYU Depth Dataset v2 [6].

1) SUN RGB-D Dataset:
It has available 10,355 RGB-D image pairs captured from different cameras: Kinect v1, Kinect v2, RealSense, and Asus Xtion. Following the public split in [9], there are 4,845 training images and 4,659 testing images. The dataset contains 19 indoor scene categories.
2) NYU Depth Dataset v2: It contains 1,449 RGB-D images distributed into 27 scene categories. However, following the benchmark split [26], the original scene categories were reorganized into 10 scene categories (9 most common and "other"). The dataset split has 795 training and 654 testing images.

B. Implementation Details
To train the segmentation network, both training and testing sets of the datasets were used in the network's training phase. The encoder network's weights were initialized with an ImageNet pre-trained model and fine-tuned over 100 epochs using the AdamW [27] optimizer with a learning rate of 0.001 and a weight decay of 0.05. Also, a cosine annealing schedule [28] is used to make partial warm restarts of the learning rate. Regarding the proposed approach, it was trained in two steps. In the first step, the weights of the CNN used in the global feature branch were initialized using an ImageNet pre-trained model and fine-tuned to the target data over 100 epochs. The second learning step combines the CNN-based global features with the SSFs over 100 epochs. In both learning steps, the ADAM optimizer with a learning rate of 10 −4 , a weight decay rate of 0.0005, and a mini-batch size of 32 were used. All experiments were implemented using the Python 3.10.4 programming language and the PyTorch framework (version 1.11.0), and performed using an Nvidia RTX 3060 GPU, 64GB RAM, and an AMD Ryzen 7-3800X-@-3.9 GHz. All RGB images and segmentation masks were resized to 224 × 224.

C. Results: Comparison with SOTA Methods
To show the effectiveness of the proposed approach, the obtained results are compared with previous state-of-theart results, as shown in Table I. The proposed GS 2 F 2 App achieved 59.8% and 75.8% accuracy values on the SUN RGB-D and NYU Depth datasets, respectively, which is, to the best of our knowledge, the highest reported state-ofthe-art results using only the RGB modality. Similar overall performances, using the RGB-only or RGB-D modalities, were reached by [12], [20]. Caglayan et al. [12] used a CNN to extract visual features at multiple levels, which were later mapped into high-level representations using a recursive neural network. On the other hand, Montoro & Hidalgo [20] exploited 2D visual texture features and 3D object-related geometric features. Closer performances were also reached by [5], [8], [13] through exploiting global and local semantic features available in the scenes. Moreover, the proposed approach also presents a significant improvement regarding the GSF 2 AppV2 [3], which exploited CNN-based global features with inter-object distance relationships.

D. Results: Ablation Study
A more in depth evaluation of the proposed approach was conducted by using segmentation masks generated by three different segmentation models, considering different segmentation overall accuracies. S20 represents a segmentation model with an overall accuracy of around 20%, S50 represents a segmentation model with approximately 50% of overall accuracy, and S75 represents a segmentation model with an overall accuracy of around 75%. Based on the current state-of-the-art segmentation results [31], [32], S50 was considered for the benchmark. Moreover, the following intermediate evaluations were also performed: CNN using the RGB images (CNN-based global features); segmentation-only (semantic features); the GS 2 F 2 App using different backbone networks. To further evaluate the effectiveness of the proposed SSFs, intermediate evaluations where SFFs were replaced by the segmentation masks, in the semantic branch of the GS 2 F 2 App, were also conducted. For such evaluations, the segmentation mask was resized to 112 × 112 and a max-pooling operation was added to each convolutional layer available in the SSFs-CNN, being the new network designated by SegMask-CNN. Moreover, a direct comparison with the GSF 2 AppV2 [3] method was also carried out. Table II shows the accuracy values, regarding the segmentation-only evaluation, achieved on the SUN RGB-D dataset. Reported results show that the proposed SSFs (SFFs-CNN) achieved a better performance than the segmentation masks (SegMask-CNN), showing that the proposed SSFs contain more descriptive scene information than the whole segmentation mask. However, as better segmentation models were used, higher overall performances using segmentationbased information were achieved. Also, the significant improvement of using segmentation masks generated by S50 instead of S20 shows that the performance of the segmentation models may have high relevance in the overall indoor scene classification performance. Table III shows the accuracy values reached on the SUN RGB-D dataset using different CNN-based global feature extraction approaches as baseline/backbone. Overall, promising results were achieved, being the proposed approach, GS 2 F 2 App, able to significantly improve the baseline (CNN) results, except for the MobileNetV2 network. Also, the proposed approach, using the same backbone network as in [3], presents a significant improvement over the results reported by the GSF 2 AppV2 [3]. The proposed approach also reached better results when compared to the results attained by using, instead of the SSFs, segmentation masks (+SegMask-CNN), which shows the effectiveness of the proposed approach. Despite the significant improvement of the GS 2 F 2 App over the baseline results, the accuracy of the segmentation model does not have a significant influence on the overall performance. This may happen due to the dataset having a high number of scene categories with inter-category ambiguity challenges, as reported in [15], and the used segmentation object categories may not be enough to uniquely represent all scene categories. However, the GS 2 F 2 App using the ResNet50 as the backbone and SSFs extracted from segmentation masks obtained from the S50, achieved a higher accuracy value, 59.8%, whose confusion matrix is shown in Fig. 3. It can be seen that the GS 2 F 2 App was able to successfully classify a subset of categories: bathroom, bedroom, classroom, corridor, furniture store, kitchen, and rest place. On the other hand, GS 2 F 2 App was not able to recognize the study space and lecture theater, which were classified as classroom. Also, the discussion area was misclassified as classroom or rest space. These specific misclassifications happen due to a high degree of similarity between the scene categories.

Results on the SUN RGB-D Dataset
Results on the NYU Depth Dataset V2 Table IV shows the accuracy values achieved on the NYU Depth V2 dataset for the segmentation-only evaluation. Reported results also showed that the proposed SSFs achieved a much better performance than the segmentation masks. Furthermore, using SSFs extracted from segmentation masks generated by a segmentation model with higher accuracy, an indoor scene accuracy improvement was achieved, which shows, as expected, that the segmentation model performance may influence the indoor scene overall results.
Table V presents the attained results using different  CNN-based global feature extraction approaches as baseline/backbone. Overall, the GS 2 F 2 App shows a significant improvement over the baseline results, being able to reach 75.8% and 76.0% accuracy values using the DenseNet as the backbone and SSFs extracted from segmentation masks obtained by S50 and S75, respectively. Moreover, regardless of the backbone network in use, the GS 2 F 2 App outperformed the baseline results, as well as the results reported in [3], showing the positive impact of the proposed approach. Also, as expected, the use of the SSFs extracted from segmentation masks generated by a segmentation model with higher accuracy led to better overall performance, especially when comparing the S20 with the S50 or S75 results. On the other hand, replacing the SFFs with segmentation masks, a poor performance was achieved, being not able to significantly improve the baseline results, which shows the effectiveness of the proposed SSFs. Figure 4 presents the confusion matrix obtained using the DenseNet + SSFs using the S50. It can be seen that the majority of categories were well classified, except for the home office which was misclassified as living room.

V. CONCLUSION
In this paper, a new segmentation-based approach to obtain a more descriptive and meaningful feature representation of the scene, that represents how objects are spread across the scene, the SSFs, was proposed. Moreover, a two-branch CNN architecture that exploits global features extracted from RGB images and segmentation-based features extracted from the proposed SSFs, GS 2 F 2 App, was also presented. The GS 2 F 2 App was evaluated on two indoor scene benchmark datasets, SUN RGB-D and NYU Depth V2, being able to achieve state-of-the-art results on both datasets, which shows the effectiveness of the proposed approach. Reported results showed that the proposed SSFs combined with CNN-based global features can significantly improve the CNN-based baseline results, showing that the SSFs have a significant positive impact on indoor scene classification tasks. Reported results also showed that the performance of the segmentation model used to generate semantic segmentation masks may have a significant influence on the overall performance achieved by the proposed approach. The best results were achieved by combining the SSFs with global features extracted by ResNet50 or DenseNet. Furthermore, the proposed GS 2 F 2 App achieved 75.8% of accuracy in the NYU Depth V2 dataset, which is, as far as we know, the best result reported in the literature.