Research on Indoor Scene Classification Mechanism Based on Multiple Descriptors Fusion

+is study aims at the great limitations caused by the non-ROI (region of interest) information interference in traditional scene classification algorithms, including the changes of multiscale or various visual angles and the high similarity between classes and other factors. An effective indoor scene classification mechanism based on multiple descriptors fusion is proposed, which introduces the depth images to improve descriptor efficiency.+e greedy descriptor filter algorithm (GDFA) is proposed to obtain valuable descriptors, and the multiple descriptor combination method is also given to further improve descriptor performance. Performance analysis and simulation results show that multiple descriptors fusion not only can achieve higher classification accuracy than principal components analysis (PCA) in the condition with medium and large size of descriptors but also can improve the classification accuracy than the other existing algorithms effectively.


Introduction
With the rapid development of the Internet and the increasing demand for applications based on location awareness, location-based services are getting extensive attention. Most people cannot live without the location service and the navigation system based on GPS (Global Position System) in their daily life. Obviously, outdoor localization technology has been relatively mature, and many mobile devices also refer to outdoor location technology [1,2,3,4]. Due to the particularity of indoor environment, the GPS signal cannot directly meet the requirements of indoor localization service. At present, there are many indoor localization methods [4][5][6], mainly including WiFi, RFID, Bluetooth, Ultrawide band, and so on. Nowadays, the visual indoor localization system [7][8][9] is attracting more and more attentions of the researchers all over the world due to the advantages of low deployment cost, strong autonomy, and high localization accuracy.
A large visual database, namely, Visual Map, has occasionally been established at offline stage to achieve accurate indoor visual localization. Visual Map may contain a large number of images or image features of different scenes and corresponding location information, which is the foundation of visual indoor localization. When the user performs a location query online, the image will be retrieved in the Visual Map. Traditional image retrieval algorithms rely on pixel point matching [10,11], which can only give the results of image matching but does not contain the visual image location information. In addition, existing image retrieval algorithms often carry out global traversal search, which leads to excessive time overhead and is not conducive to real-time localization of mobile users. erefore, an effective indoor scene classification mechanism is proposed in this paper based on multiple descriptors fusion. e images in Visual Map will be classified according to the scenes, so as to reduce the time overhead of visual images retrieval at online stage and improve the efficiency and accuracy of indoor scene classification. In this paper, both the visual information and the depth information of an image are fused. e visual image mainly contains color information, and each point on the depth image corresponds to the visual image and contains position information. Both types of images are captured by Microsoft Kinect 2.0.
In the indoor scene classification mechanism, the initial descriptor set containing two kinds of image descriptors will be generated by the existing spatial pyramid model (SPM) [12,13]. en, the greedy descriptor filter algorithm (GDFA) will be proposed to find out the valuable descriptors. Multiple fusion descriptors will be generated by homologous and nonhomologous combination to further enhance the effectiveness of descriptors. Finally, support vector machine (SVM) will be adopted for classification. e overall framework of the indoor scene classification mechanism is shown in Figure 1. e remaining of the paper is arranged as follows: Section 2 reviews the research progress of scene classification techniques and their applications in indoor scenes. Section 3 describes the generation of the initial descriptor set and the descriptor filtering in detail. Section 4 introduces the experimental database of this paper and shows descriptor evaluation results. In Section 5, two combinations of homologous and nonhomologous will be realized and the combination results will be evaluated. Section 6 concludes the article.

Motivation
At the Scene Understanding Symposium held at MIT in 2006, an important point was clearly stated for the first time, namely, scene classification is a new promising research direction for image understanding. Although existing classification methods claim to be able to solve any scene classification problems [14,15], the experimental outcome shows that only the outdoor scene classification can be effectively solved by these methods, while the indoor scene classification problems may still be a challenging task. In addition, [16] shows that the classification accuracy of the indoor scene is far lower than that of the outdoor scene adopting the same feature extraction and classification recognition methods. erefore, it is important to improve the classification accuracy of the indoor scene.
In early studies, low-level features of images were usually extracted to classify scenes, such as color, texture, and shape [17][18][19]. However, these methods based on low-level features have not been a hot topic in the field of scene classification due to its unsatisfactory classification effect. In order to overcome such problems, the methods based on middlelevel features of image are proposed. e global feature Gist is adopted and improved in [20]. e good identification ability of scale invariant feature transform (SIFT) makes it always be adopted as the local features with the highest priority in many scene recognition algorithms [21]. Shi et al. [22] proposed an indoor scene classification algorithm based on the enhancement of visual sensitive area information. And local features and global features are integrated by the visual sensitive area information.
With the rise of Kinect, the scene classification algorithm based on depth information [24,25] has received more and more attention. e histogram of oriented gradient (HOG) algorithm [26] is adopted to classify depth images and visual images, respectively [28]. SIFT is adopted to extract features of depth images and color images, and SPM coding is adopted to classify images after feature fusion [29]. SIFT of visual images and speeded up robust features (SURF) [27] of depth images are fused to classify images [30]. Five deep core feature extraction algorithms are designed in [31] to extract the size, edge, and shape information of visual images, respectively, and the extracted information is fused for classification.
As research continues, the model based on the convolutional neural network (CNN) [16,23] has attracted the researchers. However, massive training sets are required in CNN, which may result in relatively long training time. In addition, CNN usually has high computing requirement on the platform, so it is difficult to realize indoor scene classification on the platform with limited computing resource.

Multiple Image Descriptor Generation and Filtering
Inspired by [28][29][30][31], visual information and depth information will be fused in this paper. e higher accuracy indoor scene classification effect will be achieved by the spatial 3D information contained in the depth image, which is insensitive to light and reflects the position relationship between objects. Features of the original images will be extracted by D-SIFT (Dense SIFT) [32], and similar features will be clustered to form BoW (Bag-of-Words) [33][34][35] by K-means [36,37]. Based on BoW, the initial descriptors set including visual image descriptors and depth image descriptors will be generated with the construction of SPM. It is true that the number of initial descriptors is large and the quality is uneven. In addition, combining directly with unfiltered initial descriptors will lead to an explosion of the combined results. erefore, a simple and effective descriptor filtering algorithm ought to be proposed to obtain those valuable descriptors.

Initial Descriptors Generation.
e descriptor generated expression could be derived from the following procedure. Let I be any input image and x be a descriptor generated by the image. L is a set of predefined class tags, and l is one of them. e function of generating descriptor x from image I can be expressed as g(I) � x, and the probability of successfully matching descriptor x to class tag l is Pl | x. erefore, the expression of the most appropriate class tag l will be l � arg max l∈L P(l | g(I)). (1) e key to the research will be turning the initial descriptors into valuable descriptors with high classification accuracy. In order to find such descriptors, equation (1) will be further optimized. On the premise of the best descriptor filtering and combination methods, a correct class label assigned to input image I will be l (l ≠ l) and X is adopted to express a set of multiple image descriptors. en, the optimized descriptor generation expression will be g(I) � arg max g(I)∈X P(l | g(I)). (2) According to equation (2), the initial descriptors generated by the input image can only get the desired classification effect through filtering and combination. Initial descriptors are large in number and poor in quality, while descriptor filtering can discard worthless descriptors and descriptor combination can improve the effectiveness of descriptors.
e descriptor generation process based on SPM will be described as follows.

Spatial Pyramid Model.
In recent years, the BoW model has been widely adopted in computer vision. It takes the image features as visual words and classifies images by counting the number of visual words in each image. However, the traditional BoW lacks the spatial position information [29]. In this research, SPM will be established to cut the image into scale cells, then the number of visual words will be counted in each cell and the histograms can be drawn. Finally, histogram features at all scales will be linked together to form an eigenvector. We assume that a part of visual words has been selected as basic features. e steps of descriptor generation based on SPM are described in detail as follows: (i) Extracting the D-SIFT feature. e SPM-based descriptor generation process is shown in Figures 2(a)-2(c), and each cutting type will be divided into three columns for clear explanation. As shown in Figure 2(a), the first column shows the cutting type of the initial image, the second column represents the statistical results of visual words for each cell, and the initial descriptors formed by connecting the second column histograms are shown in the third column. e image contains 5 visual words; three pyramid hierarchies; and vertical, horizontal, and grid, the three cutting methods.
e descriptors generation based on SPM mainly depends on three important parameters: BOW size (S), pyramid hierarchy (H), and cutting method (C). H � 0 represents the first hierarchy, and the image is cut 0 times. H � 1 represents the second hierarchy, and the image is cut 1 time; H � 2 represents the third hierarchy, and the image is cut 2 times. erefore, the number of cutting depends on H. In other words, when H � h, the image will be cut h times, and the number of cells generated after cutting is 2 HC . Finally, seven different descriptors are obtained in Figure 2, whose size increases exponentially with the number of H and C and has a linear relationship with dictionary size S. e calculation formula of descriptor size η is as follows: As we know, image descriptors contain semantic and spatial distribution information of the scene. S will determine the semantic meaning of descriptors, while H and C will focus on the spatial distribution of descriptors, ensuring that more detailed information can be provided. e larger S will provide more detailed semantic information, making features more obvious and more representative. However, if there are a lot of visual words, the histogram will become longer, which will affect the image retrieval and matching process, subsequently. Analogously, a higher pyramid hierarchy contains more detail, while a lower hierarchy is more general.
As can be seen from [12,13,38], the standard values of the three parameters are S � 20, 50, and 100; H � 0, 1, and 2; and C � 1 (horizontal and vertical segmentation) and 2 (grid segmentation), respectively. 21 different visual image descriptors and 21 depth descriptors can be obtained by combining these standard values. e reason why the number of descriptors is 21 instead of 27 (3 3 ) is that H � 0 in the pyramid model does not cut the image, with no demands for combination indeed. In other words, for any S, the first pyramid hierarchy will deal with only one descriptor, while the second and third pyramid hierarchies will deal with three descriptors.

Descriptors Filtering.
In this section, the greedy descriptor filter algorithm (GDFA) will be proposed to find the most valuable descriptors in the initial descriptor set. Since η of the initial descriptors mainly gathered in (0, 400] (as shown in Figure 3), η is divided into three continuous intervals (0, 150], [150,350],and[350, ∞) for the convenience of descriptor filtering. We assume that large, medium, and small intervals are suitable for our data-gathering platform with small, medium, and high computing power configurations, respectively. e descriptor weight α is related to the descriptor classification accuracy ζ and descriptor size η . In order to obtain descriptors with smaller size and higher accuracy, the calculation formula of the weight α could be defined as follows: e greedy descriptor filtering algorithm (GDFA) flow is given in Algorithm 1.
At first, the weight of all descriptors is calculated according to equation (4). Next, the descriptor size is divided into (0, 150], [150,350], and [350, ∞) three continuous intervals, and then the descriptors are sorted in order of weight values from the largest to the smallest. e descriptor with the largest weight in N i is filtered and added to the first position in F. If the descriptor weight is greater than 95% of the weight of the previous selected descriptor, that is, (α i > 0.95α i−1 ), the descriptor is filtered out; otherwise, the next descriptor will be compared. GDFA not only could find out the most valuable descriptors in each interval, but also could filter out descriptors with similar weights.

Experimental Database.
In order to study the indoor scene classification mechanism, as shown in Figure 4(a), the indoor image data gathering platform with Microsoft Kinect 2.0, independently developed by the laboratory, will be adopted to carry out image data gathering in the Heilongjiang University physical laboratory building. e database contains visual and depth images captured in 9 indoor scenes under different lighting conditions. To cite some examples, Figure 4(b) shows part of the database images.
e database images will be randomly divided into 5 sequences, namely, Training 1, 2, and 3 and Test 1 and 2. e image number for 9 scenes in 5 sequences is listed in Table 1.

Evaluation
Results and Analysis. K-fold cross-validation could be a common accuracy test method, which can effectively avoid over-learning and under-learning. 10-CV (10-fold cross-validation) will be adopted to evaluate the classifier model in this section. To ensure that each cross-validation image is similar, a subset of 30 consecutive images will be randomly assigned to Fold1-Fold10 (represents 10 subsets of the 30 images), which effectively prevented any deviation caused by the time continuity in the data set. Figure 5 shows Mobile Information Systems the distribution of each scene in the data set in each fold of 10-CV and global distribution. It is worth noting that scenes in the data set are not evenly distributed in Fold1-Fold10. Table 2 shows the classification accuracy of initial descriptors of 42 visual image descriptors and depth image descriptors after 10 times of cross-validation. In SPM, when H � 0, for any kind of segmentation type, there is no image cutting and the generated descriptors are identical, so the evaluation results are identical too. By comparing the results of visual images and depth images, we can find that the classification accuracy of depth images is significantly lower than that of visual images. e reason may be that the visual coding technology (visual coding is the mapping between data and visual results) of the depth image is not accurate enough to obtain fine-grained data.
GDFA can find the valuable descriptors from the initial descriptor set, which will facilitate the descriptor combination work in Section 5. Table 3 shows the internal parameters and classification accuracy of the 4 visual image descriptors and 7 depth image descriptors filtered by GDFA, analogously, and the evaluation data are from 10-CV. In other words, the 42 initial descriptors given in Table 2 (9) Filter the descriptor N i [1] with the largest weight in N i and add N i [1] to Φ i (10) if N i [j − 1] is filtered and α[j] > 0.95 * α[j − 1] then (11) Add N i [j] to α i (12) else (13) end (14) end (15) end Output: filtered descriptor list---α ALGORITHM 1: Greedy descriptor filtering algorithm. 6 Mobile Information Systems reduction with PCA can preserve the most important features in high-dimensional data and remove noise and worthless features, which could improve data quality and data processing speed. Figure 3 shows the comparison between the filtering result of GDFA and the dimensional reduction result of PCA (the solid point in Figure 3   Mobile Information Systems

Descriptor Combination
e most valuable descriptors have been selected by GDFA in Section 4. In order to further obtain the high-quality and highly efficient final descriptor, this section will propose a multiple descriptor combination algorithm (this section only combines two descriptors) although this step might increase the running time of scene classification. ere will be two descriptor combination levels, as shown in Figure 6.
One is the descriptor level (DL), which can be input to SVM1 after the descriptors of Image1 and Image2 have been connected into one combination descriptor, as shown in Figure 6(a). e other one is the classifier level (CL), which weights the different response results after Image1 and Image2 have been input to SVM1 and SVM2 separately, as shown in Figure 6(b). Also, this section will discuss homologous combinations (V + V or D + D) and nonhomologous combinations (V + D). Table 1: e number of images of 9 scenes in 5 sequences.

Homologous Combinations.
is section will combine two descriptors extracted from the same image type, namely, V + V or D + D, which are called homologous combination. e combination will be carried out at DL and CL, respectively. e test set of SVM could have been composed of two groups of sequences with obvious light differences, Test 1 and Test 2, respectively.

V + V.
ere are 6 different combinations of the 4 depth image descriptors V1, V2, V3, and V4 given in Table 3, which will be applied to DL and CL, respectively. e classification accuracy obtained in Test 1 and Test 2 is shown in Figures 7(a) and 7(b), respectively.

D + D.
ere are 21 different combinations of the 7 depth image descriptors D1, D2, D3, . . ., D7 given in Table 3, which will be applied to DL and CL, respectively. e classification accuracy obtained in Test 1 and Test 2 is shown in Figures 8(a) and 8(b), respectively. Comparing Figure 7 with Figure 8, we find that the classification accuracy of D + D is generally lower than V + V. e highest classification accuracy in Test 1 and Test 2 achieved by the best depth image descriptor D7 is 48.79% and 65.45%, respectively (while the highest classification accuracy in Test 1 and Test 2 achieved by the best visual image descriptor V4 is 74.76% and 85.78%, respectively). When the best initial descriptor D7 acts as the parent descriptor, the highest classification accuracy of DL is 56.07% in Test 1, while it is 71.86% in Test 2. Apparently, the classification accuracy in Test 2 is still higher than that in Test 1 in D + D.
Similar to V + V, DL always outperforms CL in D + D. e classification accuracy of combination descriptors in DL is always higher than the parents' descriptors (39 out of 42), while only a few combination descriptors have higher classification accuracy than parents' descriptors in the CL (16 out of 42). e internal parameters of D7 are S � 100, H � 2, and C � horizontal. D5+D7 (56.07%) achieves a favorable effect, and the internal parameters of D5 are S � 100, H � 1, and C � horizontal. D2+D7 (71.86%) also achieves a favorable effect, and the internal parameters of D2 are S � 50, H � 2, and C � horizontal. e similarity of the optimal combination is C � horizontal, which is verified in Section 4. In addition, the internal parameters of V4 and D7 are S � 100, H � 2, and C � horizontal. So, we can speculate that high classification accuracy could be obtained by descriptors with such a group of internal parameters, which will be verified in Section 6.

Nonhomologous Combinations.
is section will combine two descriptors extracted from different image types, namely, V + D, which is called as nonhomologous combination. ere are 28 different combinations of V1, V2, V3, and V4 and D1, D2, D3, . . ., D7 in Table 3, which will be applied to DL and CL, respectively. e specific evaluation process is the same as homologous combination, and the evaluation results are shown in Figure 9.
In Test 2, the highest classification accuracy of CL and DL reaches 80.36% and 92.64%, respectively, while in Test 1, it reaches 72.84% and 81.76%. is is consistent with what     we found before, the classification accuracy of Test 2 is always higher than Test 1, and DL always outperforms CL. In CL, the combination with the highest classification accuracy is D5+V4 (72.84%) in Test 1. In the meantime, the classification accuracy of V4, which acts a parent descriptor, is 74.76%. e combination with the highest classification accuracy is D7 + V4 (80.36%) in Test 2. e classification accuracy of V4, which acts as a parent descriptor, is 85.78%. As shown in Figures 9(a) and 9(b), only a few combination descriptors have higher classification accuracy than parent descriptors in the CL (18 out of 56), the same as in homologous combinations. It shows that the result of CL is not satisfactory.
In DL, the combination with the highest classification accuracy is D7 + V4 (81.76%) in Test 1. In the meantime, the classification accuracy of V4, which acts as a parent descriptor, is 74.76%. e combination with the highest classification accuracy is D7 + V4 (92.64%) in Test 2. e classification accuracy of V4, which acts as a parent descriptor, is 85.78%. As shown in Figures 9(a) and 9(b), the classification accuracy of combination descriptors in DL is always higher than that in parents' descriptors (56 out of 56).
We can conclude that DL outperforms CL in nonhomologous combination because most combination descriptors in DL outperform their parent descriptor, while the combination descriptors in CL might be difficult to achieve. In addition, no matter in which level, the combinations of the descriptor with excellent performance and the descriptor with poor performance outperform other combinations. To cite some, D1+V4 precedes D1+V1, D1+ V2, and D1+V3 in Figure 9(b).
Combining Figures 7-9, we can conclude that the overall effect of V + V and D + V outperforms D + D. Sometimes V + V outperforms D + V although nonhomologous combinations contain more comprehensive information. DL combines descriptors before entering a classifier, which may preserve characteristics of the descriptors completely. is may be the reason why DL is always better than CL. So, we only compare the evaluation results of V + V and V + D in DL. Table 4 lists the best combinations of homologous and nonhomologous in DL, as well as the highest classification accuracy (bold data) obtained in Test 1 and Test 2. e best combination is V3 + V4 in Test 1, and the best combination is D2 + V4 in Test 2. We recall that the light variation in Test 1 is stronger than that in Test 2. So V + V can be the best in Test 1, while D + V can be the best in Test 2.   As shown in Table 3, descriptor size has 8 possible values (including single descriptor or combination descriptor), respectively: 20, 40, 200, 220, 400, 420, 600, and 800. e maximum classification accuracy corresponding to each descriptor size value is compared with PCA results. Figure 10 shows the relationship between classification accuracy and descriptor size in Test 1 and Test 2. As we can see, the classification accuracy of the multiple descriptors fusion mechanism can be improved significantly with the descriptor size from small to middle. Also, the classification accuracy gradually tends to be stable with the descriptor size from middle to large. In Test 1, when descriptor size equals to 400 (large), V2 + V3 (80.94%) gets the highest classification accuracy. In Test 2, when descriptor size equals to 600 (large), D2 + V4 (92.64%) gets the highest classification accuracy. PCA achieves high classification accuracy in the condition with small descriptor size. e superiority of the multiple descriptors fusion mechanism becomes obvious with the increasing descriptor size.

Execution Time.
Indoor scene classification is divided into two stages: offline training and online testing. It is assumed that the construction of BoW and classifier training has been completed at the offline stage. erefore, what affects the   running time of the online stage is the generation and classification of descriptors, including 4 steps, as shown in Table 5.
Step 2 is related to BoW size (S), so S � 20, 50, and 100 are studied, respectively. Step 3 depends on the size and number of image cells, which is related to pyramid hierarchy (H) and cutting method (C). Step 4 is determined by η.

Algorithm Analysis and Comparison.
Under the same database, the classification accuracy obtained by our mechanism will be compared with other fusion methods, as shown in Table 6. e classification accuracy obtained by the algorithms with single feature fusion [28][29][30] tends to be low for the indoor scene, largely because these algorithms do not filter descriptors. So it seems that the algorithm with single feature fusion is suitable for indoor scene classification. Higher classification accuracy is obtained by the algorithm with multiple features fusion [31], which extracted five different kernel descriptors from the images. After integration, they are trained and classified by Linear SVM, Kernel SVM, and Random Forest, respectively, and obtained 89.6%, 90.0%, and 90.1% accuracy in this experiment. 92.6% accuracy is achieved by our classification mechanism, which has a 2.5% higher value than in [31]. Above all, multiple descriptors fusion mechanism has good performance in indoor scene classification.

Conclusion
Aiming at the actual demands for indoor positioning applications, a multiple descriptors fusion model is established and an image classification strategy is proposed to improve the quality and efficiency of descriptors so as to achieve a better indoor scene classification effect. Firstly, the initial descriptor set is formed based on the established SPM. en, the greedy descriptor filtering algorithm is adopted to select the descriptors with high weight in each descriptor size interval and a valuable descriptor set is obtained. Finally, the multiple descriptors combination algorithm is proposed to obtain high-quality and highly efficient multiple descriptors by combining homologous and nonhomologous images at DL and CL, respectively. e generation, filtering, and combination of multiple descriptors proposed in this study improve the performance of the classifier. e evaluation results reflect that the multiple descriptors fusion mechanism proposed in this study outperforms the well-known PCA dimensionality reduction technology, especially for the condition with medium or large descriptor size.
is strategy not only achieves better results than other feature fusion algorithms, but also solved the limitations of existing scene classification algorithms applied to interior scenes.
Future research will focus on the improvement of the image feature extraction algorithm and the efficiency of constructing visual words by clustering features in the visual BoW model by other clustering algorithms. More attention will be paid to enhance the effectiveness of descriptors when describing image information. At the same time, the improvement of the quality of the depth image will be taken into account so as to make more efficient use of depth data in the process of descriptor filtering and descriptor combination. Alternatively, a more complete data set can be adopted.

Data Availability
e data results used to support the findings of this study are presented in this paper.