Building use and mixed-use classification with a transformer-based network fusing satellite images and geospatial textual information

.


Introduction and related work
Urban land use, as the highest level of human modification (Li et al., 2020;Theobald et al., 2020), reflects socio-economic functions and human activities.It is an important component of urban planning (Srivastava et al., 2019), landscape design, environmental management, health promotion, biodiversity conservation (Chen et al., 2021b) and city digital twins (Akroyd et al., 2022;Xia et al., 2022).Most research currently provides dominated categories for each land use unit and excludes the presence of mixed-use (Chen et al., 2021b;Gong et al., 2020;Häberle et al., 2022;Srivastava et al., 2018b;Zhu et al., 2019).Mixeduse is defined as a blending of multiple uses of a single object in the same space.Mixed-use may occur for different spatial units, in particular individual buildings, street blocks, and neighborhoods (Raman and Roy, 2019).Mixed-use, however, is a critical component in smart growth (Song et al., 2013), public health (McGuire, 2014), quality of life (Urbanism, 2000), compact cities, eco-cities, cycling-friendly cities, and sustainable development (Jiao et al., 2021).For example, The Congress of New Urbanism' Charter argues that: "Neighborhoods should be compact, pedestrian-friendly, and mixed-use" (Urbanism, 2000).Thus, acquiring mixed-use information is the basis for evaluating existing planning and design as well as for planning future urban development strategies.Recent research on mixed-use focuses on its quantification, e. g. using Shannon diversity index (He et al., 2021) of administrative units without considering detailed use information.Therefore, it is important to map land use including its mixed-use.
Due to the population increase and high urbanisation rates, urban land use changes rapidly.Collecting land use information including mixed-use at fine spatial scales is usually laborious and resource intensive, involving numerous field surveys (Liu et al., 2018;Zhan et al., 2014).This makes it imperative to design models that are capable of automating the generation of accurate and up-to-date land use maps including mixed-use.
Remote sensing (RS) images can provide interesting and specific land use information.The methods for Land use classification (LUC) from RS images can be categorized into three types based on the way of using images information.1) Traditional pixels based classification methods such as support vector machines, fuzzy k-means clustering algorithms (He et al., 2014), maximum likelihood classification (He et al., 2014;Khorram et al., 1987;Rozenstein and Karnieli, 2011), which use the spectral information of individual pixels directly.2) Object based image segmentation and classification methods (Galletti and Myint, 2014), which consider image spectral information and incorporate geometric and texture information of segmented objects.3) Deep learning based image classification methods (Bergado et al., 2020;Huang et al., 2018a;Li and Stein, 2020;Zhang et al., 2018aZhang et al., , 2019;;Zhou et al., 2020), which automatically learn a large number of deep features from images without manual feature extraction.These last methods usually obtain better performance than OBIA.
While the spatial resolution of RS images and available classification methods have improved, it is still challenging to obtain detailed land use information.Other types of data sources may also provide land use information, such as social media images reflecting building instance classification (Hoffmann et al., 2022;Hoffmann et al., 2019;Zhu et al., 2019), and social text information reflecting land use information (Chen et al., 2020;Häberle et al., 2019;Jendryke et al., 2017;Zhu et al., 2019).Multiple data sources have been combined in the past for detailed urban land use classification (Chen et al., 2021a;Gong et al., 2020;Hu and Wang, 2012;Huang et al., 2018b;Song et al., 2018).Here we consider point of interest (POI) data point data with coordinates and a site name, indicating for instance, use information of a location.POI data has been employed for building use classification (Deng et al., 2022;Lin et al., 2021), urban mixed-use measurement (Liu et al., 2018;Yue et al., 2017), and urban land use mapping (Barlacchi et al., 2021;Zhong et al., 2020).To achieve land use maps with detailed use information, we will leverage multiple data sources, capitalizing on the recently developed possibilities to fuse social media and RS data for geo-information retrieval, following Zhu et al. (2022).
Each data source can be seen as a modality.Multimodal integration refers to the process which integrates information from multiple modalities to create a coherent perception or understanding of the world.Multimodal integration methods can be divided into three types, 1) Data fusion at the early or input level (Khorram et al., 1987), where data that share the same form of media are combined, generating a new type of data.An example is the fusion of panchromatic images with multispectral images to produce a new image with high spatial and spectral resolution (pansharpening).2) Feature fusion at the intermediate level (Antol et al., 2015;Mroueh et al., 2015;Ouyang et al., 2014;Wu et al., 2014), where features are extracted from different forms of media and fused into one same feature space.For instance, information is extracted from an image and from a text, both are transposed into a vector, and vectors of these two modalities are concatenated into a new vector.3) Decision fusion at the late level (Cao et al., 2018;Chen et al., 2021a;Gong et al., 2020;Häberle et al., 2022;Workman et al., 2017), where each modality generates one decision, while results of different modalities are combined to generate an overall decision.Feature fusion and decision fusion are most suitable for research with input data consisting of different form of media such as image and text.So far, LUC research has been based primarily upon multimodality decision fusion (Cao et al., 2018;Chen et al., 2021a;Gong et al., 2020;Häberle et al., 2022;Lu et al., 2022;Workman et al., 2017;Zhong et al., 2020), while feature fusion based LUC studies (Srivastava et al., 2019) are rare.
Much effort has been made to develop multimodality land use classification.So far, the following problems have not yet been solved: 1) LUC research is mainly based upon pixels, objects (Häberle et al., 2022;Kang et al., 2018;Srivastava et al., 2019), and scene blocks (Zhang et al., 2018b;Zhou et al., 2020) as basic units.The bigger spatial units, however, may contain several land use categories, while current research usually assigns a single dominant category to each unit, thus neglecting mixed-use from the majority type of land use unit.2) When fusing imagery information with textual information, most research use the decision fusion multimodal integration, neglecting the relationship between different modality features.For example, Häberle et al. (2022) used Bi-directional LSTM for text classification, several CNN models to classify images, and a single decision fusion method to combine their results.Song et al. (2018)  The above research integrated the classification of textual data with features from other data or classification results but has not effectively used the relations between features extracted from different modalities.
To alleviate the above shortcomings, we realized that buildings may be devoted to more than one human activities.Considering that building use classification is a subset of land use classification, we propose multimodal Transformer-based feature fusion for building use classification based on remote sensing images and POI data.In our study, buildings are the objects with the smallest non-divisible units for land use classification.Rather than the current land use studies giving a dominant category to land use units, we give building use categories considering mixed-use situation.To do so, we utilise the relationships between different modalities by projecting textual features and image features into the same space, and then use a Transformer network (Vaswani et al., 2017) to classify fused features.The contributions of this work are as follows: 1) We increase the spatial and attributive grain of LUC by considering mixed-use of objects (buildings) level land use units.Thus, we diverge from assigning a dominant use category to each land use unit.Instead, we aim to predict the complete set of use categories for each building by considering various combinations of uses as a distinct type of building use category.By doing so, we aim to enrich the semantic information associated with buildings, offering a more comprehensive understanding of their functional attributes.In particular, it allows us to capture the intricate and diverse ways in which buildings are used, and provides a more nuanced representation of urban spaces.2) We propose multimodal Transformer-based feature fusion, which simultaneously learns textual features, image spatial features and their relationships, and gives different modality features different attention.3) We investigate the synergy between RS and POI for fine attributive grain building use classification, and the performance of decision fusion based and feature fusion based multimodal integration method for building use classification.

Study area and data
We selected four urban study areas from China: Wuhan, Zhengzhou, Xiamen, and Beijing, both in China (Fig. 1).These study areas cover northern China, middle China, southern China, and represent a diverse range of geographic characteristics, including coastal and inland cities.In terms of social and economic factors, Beijing is classified as a first-tier W. Zhou et al. city due to its high level of development and significant economic influence.Zhengzhou and Wuhan have recently been designated as firsttier cities, and Xiamen is considered a second-tier city.The choice of these four research areas is deliberate and idiosyncratic.They serve as unique test cases to evaluate the transferability of our proposed model.By including cities with different levels of development and economic profiles, we can assess how well our model adapts to varying urban landscapes and verify its effectiveness in diverse contexts.
The Wuhan study area covers most area of the Jianghan district, and parts of the Dongxihu, Qiaokou, and Jiangan district.The Zhengzhou study area covers parts of the Zhongyuan, Huiji, Jinshui, Guangchenghuizu, and Erqi districts.The Xiamen study area mainly lies in the Huli district and includes parts area of Siming district.Finally, the Beijing study area fully covers the Shijingshan district, and parts of the Mentougou, Haidian, Xicheng, Fengtai districts.Satellite images of the first three cities were obtained from the SuperView-1 satellite, all with a consistent spatial resolution of 0.5 m, while the image of Beijing was acquired from the GF-2 satellite, with the same spatial resolution.Images of Wuhan, Zhengzhou, Xiamen, Beijing is from 2019, 2016, 2020, and 2022, respectively.POI data of the four cities were acquired from "Amap" (https://lbs.amap.com/)from the same years same with their satellite images.Building footprints of the four cities are downloaded from the "Baidu Map" acquired in the year corresponding to their satellite images.These contain 6566,40,487, 7179, and 54,445 building footprints, respectively.

Methodology
Building is the basic spatial unit of this research; to determine its use category, we need three key processes (Fig. 2).We first capture the RS image and POI data corresponding to the same building, and use a file to align these two types of data.Second, we generate building use classification data sets by manually labelling their category and specifying their uncertainty.Third, we use this data set to train the multimodal deep learning method, and used this trained model to classify unlabelled buildings.

Data sets generation
We merged adjacent building polygons if adjacent polygons belong to the same building but were divided into several parts.Next, we used building polygons to capture aerial images.As shown in Fig. 2, we generated the centre point of every building polygon, and then captured RS image patches by using the centre point as the captured patch's centre and setting a suitable patch size.This progress can guarantee that the corresponding building lies in the centre of the captured RS image patch with its surrounding around the patch.In this research, the size of extracted patches is 224 × 224 pixels.This size is large enough to involve most buildings' own and their surroundings' information.
Matching between POI data and building polygons is done according to their spatial relationship.Most POI lie within the building's polygon, W. Zhou et al. while some POIs that describe the information of buildings may be outside the building footprint, e.g., some of the POIs of the blue circle in Fig. 2. Therefore, we matched every POI with its corresponding building by searching the nearest building of that POI within a 5 m radius.
We labelled the building use category based upon two modalities according to the land use classification system proposed by the Ministry of Housing and Urban-Rural Development of the People's republic of China (CAUPD, 2018), see appendix.According to this classification system, building use is classified into six main types (Table 1).Currently building use classification considering mixed-use is rare (Srivastava et al., 2018a).In our research we have considered buildings' mixed-use by assigning class labels that combine multiple categories.
We selected the Wuhan study area for generating the labelled data set, containing 6566 building footprints.We labelled building category by combining a visual inspection of RS image and doing a reading analysis of the POI data.For example, some residential, industrial, buildings can be interpreted from RS image only, and POI data can tell if other usage occurs involved in these buildings.Some buildings are hard to interpret from RS image, and are labelled according to their POI.And some buildings lack POI and are hard to interpret from RS image; those assigned the category "Unknown".We found that 34.8% buildings lack POI data, while 45.7% among these are hard to label from only their corresponding RS images.After labeling, building use was classified int 23 categories.As shown in Table 1, the number of samples for the different categories is too low to train the multimodal deep learning model.Therefore, we selected 5451 buildings in 8 categories.i.e. the bold categories in Table 1, for our experimental analysis.We randomly selected 60% of the labelled buildings as training samples, 20% as validation samples, and 20% as test samples.
Table 2 shows the number of labelled samples in each category in different data sets, and the proportion of samples including POI data.All samples of the category of "RBA", "BA", and "RA" contain POI data, counting 9.8% of the total, where labelling into these categories relies on POI data.Also, 98.9% in category "RB", 96.5% in category "B", and 84.7% in category "A" contain POI data, being 57.2% of the total.The reason is that most of the labels are determined according to two types of data, especially the POI data, and only a few samples can be assigned labels according to the labels of the surrounding similar buildings.88.5% of "I" (industrial use buildings) lack POI data, but these can be well recognised from the satellite image.Finally, 46.7% of "R" have no POI data, which means these residential buildings are labelled according to satellite imagery only.

Data augmentation
The maximum length of the input text sequence in the deep learning method is usually fixed.If the length of the input text sequence exceeds the max length, then it is truncated, while if it is shorter, it is padded up to the max length using zero values.In this paper, the max length of input text sequence has been set to 300 characters.To adequately learn the data features, we augmented the training and validation data set of Wuhan study area by adjusting the orientation of satellite images and the sequence of the POI data contents.Fig. 3 shows a sample of a building's captured satellite image, and its augmented result.
Among the 5451 selected building samples, 54.5% have more than one corresponding POI data and we have gathered each building's POI data in two different turns.We next adjusted the sequence of the POI data contents to augment the data set.Each downloaded POI data has a unique ID, which we used to adjust their sequence.We combined the captured satellite images without orientation change like Fig. 3 (b) with POI contents (P 1 , P 2 … P n ), ordered according to increasing size, and the captured satellite images with orientation change like Fig. 3 (c) with the reversely ordered POI contents (P n … P 2 , P 1 ).The POI content for buildings without POI data was set to "Null".

Data set uncertainty
The use categories of buildings are manually labelled based upon the information as reflected by RS and POI.The first uncertainty is associate with the ability of the two modalities to sufficiently reflect the actual use of the building.We did field checking work to evaluate this uncertainty by comparing the labels given according to the two modalities with the in-situ results.The second uncertainty concerns human errors in manually labelling the training samples.We assessed this uncertainty component of Wuhan by randomly sampling 10% of the data set and relabelling them based upon the two modalities, followed by comparing the relabelled result with their original labels.
For buildings with multiple use labels, we used the multi-label evaluation method for the first type of uncertainty, using the Accuracy (A, Eq. ( 1)) and F 1 score (F 1 , Eq. ( 2)).We represented the field audit label as Y i , and the assigned label as Z i .The second type of uncertainty was evaluated by randomly selecting and relabelling building samples and calculating the correspondence between the relabelled results and the original results.Fig. 4 shows the field checking and relabelling samples for the first and second types of uncertainty evaluation.
where n is the total number of field checking samples.
To evaluate the uncertainty of classification results, 20% of the samples of Wuhan have been selected as test data.For other research areas, we randomly selected and manually labelled 1332, 1040, and 1411 samples respectively as test data for the Zhengzhou, Xiamen, and Beijing study area.The detailed number of different category and the POI containing ratio is shown in Table 3.

Transformer based multimodal deep learning model
The proposed method involves two types of modalities containing both image and textual features, respectively.Self-attention-based architectures, in particular Transformers (Vaswani et al., 2017), have become the model of choice in natural language processing (NLP), as they have better performance in terms of both accuracy and efficiency (Chen et al., 2018a;Vaswani et al., 2017).
Several studies on vision-language representation learning focus on modelling the interactions between image and text features with Transformer based multimodal encoders (Huang et al., 2021;Lu et al., 2020;Su et al., 2020;Zhang et al., 2021b).Therefore, Transformer is an excellent choice for textual and image feature extraction and fusion.The Supervised Multimodal Bitransformers (MMBT) (Kiela et al., 2019) model was proposed in 2020, performing better than state of the art methods.The MMBT model is built based on the Bidirectional Encoder Representation from Transformers (BERT) model (Jacob Devlin et al., 2019), which only utilises the encoder representation of Transformers.
Unlike BERT, the MMBT model includes a new module to extract image features and fuse these with textual features as the input of the model.It can employ self-attention over both modalities simultaneously, providing earlier and more fine-grained multimodal fusion.The MMBT model proposed by Kiela et al. (2019) used ResNet (He et al., 2016) to capture image features.In this research, we have replaced Resnet with DenseNet (Huang et al., 2017).Because the identity shortcut of ResNet stabilises its training but limits its representation capacity, while Den-seNet has a higher capacity with multi-layer feature concatenation, although it requires high GPU memory and more training time (Zhang et al., 2021a).
Fig. 5 shows the architecture of the revised MMBT model.As shown in Fig. 5, the pretrained BERT and DenseNet are used as the backbone to extract textual features and image feature, and these two networks will then be finetuned in the proposed network.For each building, according to the Json file, its POI data's text content will be input into network.The pre-trained WordPiece tokeniser used in BERT was used to split a word into subwords and characters, and the pre-trained BERT vocabulary was used to transpose tokens from tokens to token sequences.The pretrained BERT model is "bert-base-chinese" which was trained on the Chinese version of Wikipedia.Then token sequences were transposed to token embedding through the pre-trained token embedding layer.
For each building, its corresponding RS image is input into model, according to its image patch's name and folder.Then the pre-trained DenseNet (trained on ImageNet) is used to extract image features, capturing the output features after the final pooling operation.Next, these features are transposed to vectors with the same dimension as text token embeddings.Segmentation embeddings are used to distinguish text token embedding and image embedding by assigning different segment embeddings to them.0-indexed positional coding is used for each segment to record token positions, i.e., start counting from 0 for each segment.Token, segmentation, and position embeddings were then fused and are input into the Transformer encoder for the classification task.

Classification result evaluation
For the multi-class classification task, two F 1 scores are commonly used as accuracy evaluation indexes, i.e. the Macro F 1 score (MAFS) and the Micro F 1 score (MIFS) (Santos et al., 2011).Compared to MAFS, MIFS considers the uneven number of samples of the different classes, which is more suitable for our research.In fine tuning the MMBT model, we used validation data set to evaluate model performance and adjust parameters.
For each class, precision (P i ) and recall (R i ) were obtained according to Eqs. ( 3) and (4), respectively.The F 1 score (F 1i ) is obtained as the geometric mean of precision and recall (Eq.( 5)) and the MAFS following Eq.( 6), being the average of each category's F 1 score.
TP i :true positives; FP i : false positives; FN i : false negatives; n: the number of class types.
For obtaining the MIFS we used the average precision (AP, Eq. ( 7)) and average recall (AR, Eq. ( 8)) of all samples with different classes, resulting in Eq. ( 9).

Building use classification based upon two modalities (Wuhan study area)
For the two types of uncertainty evaluation of the Wuhan data set, after the field audit, the accuracy of our generated data set equals 88.5%, and the F 1 score 91.9%, indicating RS and POI can effectively reflect actual building use.For the 10% relabelled samples, 96.2% are correctly labelled, resulting in the correctness of the generated data set equal to 96.15%, reflecting dataset of Wuhan is reliable.
We used the AdamW optimiser to train the revised MMBT with a learning rate of 0.00003 and the batch size of 16.The default training epoch for MMBT is equal to 3, which in this research was set to 8 to ensure not miss the optimal trained model.Other hyper-parameters have been set according to default values.The training progress of using two modalities for classification is shown in Fig. 6.We used validation data set to evaluate model performance during training, and  The F 1 -score and the confusion matrix of the buildings use classification results based on two modalities when using samples shown in Table 2 are shown in Table 4 and Fig. 7. Three quarters single-use category's accuracy above 60% were obtained, while a quarter for mixeduse.For instance, the sample number of "A", "BA", and "RBA" are similar, while their accuracy gradually decreases.Categories with a larger number of samples also have a higher accuracy than others, for example, categories of "R", "B", and "RB" have more samples than the other categories, and their accuracies are also much higher than these.Misclassified categories also partly reflect their true labels.For example, 38% "RBA" has been misclassified into "RB" and 15% "RBA" to "B".Also, 28% "BA" have been classified into "B", 6% "RB" have been classified into "R" and 9% "RB" into "B".Hence, also misclassified building mixed-use labels reflect part of buildings' uses.

Contribution of different data sources
We compared the building use classification results on the test data of Table 2 using both modalities against that based upon one modality only.We used a pretrained Transformer to classify POI data.For the classification of RS images, we used Transformer to classify image features extracted from pretrained DenseNet.The samples used for the experiments based on "RS&POI" are identical to those based on RS alone, while the samples used for POI only experiments are fewer due to the presence of POI data in only 76.4% of the samples, as indicated in Table 2.As shown in   Except for the categories of "RBA", and "RA" where using only POI data yields higher results compared to using two modalities, the remaining five categories demonstrate improved F 1 scores when both modalities are integrated.Table 2 shows that all samples in the "RBA" and "RA" categories contain POI data, and that the F 1 scores based solely on RS are notably low.This suggests a high reliance on POI data for determining these categories, while the inclusion of RS data introduces more noise than useful information.For the "I" category, RS imagery plays a crucial role since using POI data alone cannot accurately identify this category due to its limited presence.Table 2 shows that only 11.5% or 10 samples contain POI data.By incorporating RS data, the F 1 score for the "I" category increases by 18%, indicating the valuable information conveyed by the absence of POI data.A similar pattern is observed in the "BA" category also.This indicates that integrating RS and POI data effectively enhances building use classification.

Building use classification of the Zhengzhou, Xiamen, and Beijing study area
We analyzed the generalisation of the proposed method by applying the model trained in Wuhan to the Zhengzhou, Xiamen, and Beijing study area.We used the model trained in Wuhan based upon two modalities to classify buildings in these three cities using two modalities.We evaluated the classification results by comparing the classification results with the manually labelled reference data.
Table 6 presents the F 1 score for the classification result of four study areas.The MIFS for Zhengzhou equals 60.6%, for Xiamen 67.2%, and for Beijing 56.6%.The corresponding MAFS values are equal to 54.7% for Zhengzhou, 54.0% for Xiamen, and 52.5% for Beijing.Overall, the classification results for these three cities are lower as compared to Wuhan.In Wuhan, using both modalities, categories such as "R", "B", "A", "BA", and "RB" achieve F 1 scores above 50%, with 60% of them representing single-use buildings.Similarly, for Zhengzhou, the categories "R", "B", "I", "BA", and "RA" have F 1 scores above 50%, with 60% of them being single-use buildings.For Xiamen, categories "R", "B", "I", and "RB" have F 1 scores above 50%, with 75% of them representing single-use buildings.In Beijing, categories "R", "B", "A", "RBA", and "RB" achieve F 1 scores above 50%, with 60% of them being single-use buildings.This indicates that classifying single-use categories is generally easier than mixed-use categories.Fig. 8 displays the confusion matrix, while the classification maps can be found in Fig. 14-16 in the Appendix.
When transferring the trained model to a new domain, the accuracy usually decreases since different areas have different characteristics.This can be observed in the MIFS, MAFS and F 1 score of categories such as "R", "B", "A", and "RB" in the three transferred cities as shown in Table 6.In this research, significant improvements have been achieved for the F 1 scores of the "RBA" categories for Zhengzhou, Xiamen, and    Beijing, the "RA" category for Zhengzhou, and the "I" category for Xiamen.
Due to the fixed text length required by the model, the input text is truncated if it exceeds the specified length.The truncation ratios for different categories in different cities have been calculated and are presented in Table 7. Categories like "RBA" and "BA" have relatively high truncation ratios.Analyzing the confusion matrices depicted in Fig. 7 and Fig. 8 for the four cities, it can be observed that, the function "A" has not been correctly identified, resulting in misclassifications into categories "B" or "RB".This misidentification rate equals 54% in Wuhan, 45% in Zhengzhou, 48% in Xiamen, and 6% in Beijing.Generally, higher truncation ratios correspond to higher rates of misidentification, while lower truncation ratios correspond to lower rates of misidentification.
Another reason influencing the F 1 score of "RBA" category in Wuhan is that 23% of the samples with the function "R" have not been correctly identified.For the "BA" category, 28% of the samples in Wuhan with  "BA" category have been misclassified as "B" without the correct identification of the function "A".In Zhengzhou, Xiamen, and Beijing, the proportions are equal to 12%, 22%, and 10%, respectively.These reduction ratios are 52.8%,34.9%, 56.3%, and 43.8%, indicating that the reduction ratio can significantly impact the identification of building use.
The truncation ratio of "RA" is low, indicating less influence on the identification of "RA".Among the confusion matrices of the four cities, 19% of the samples have been classified as "A", without identifying the "R" function.In Zhengzhou, this proportion equals 0%.This suggests that identifying the "R" function in the mixed-use "RA" buildings is more challenging in Wuhan as compared to Zhengzhou.Xiamen, being a tourist city situated on an island, has fewer industrial buildings compared to the other research areas.In our test dataset, we only have 7 samples of the "I" category from Xiamen.These samples exhibit more pronounced industrial characteristics as compared to other cities, but the limited sample size might introduce sampling bias.This could explain why Xiamen has a higher F 1 score for the "I" category when compared to other cities.
Fig. 9 shows the detailed classification of Zhengzhou, confirming that most buildings are correctly classified.For instance, Fig. 9(a) shows objects around a university.We note that buildings used for education have been correctly classified into categories "A", and "BA".Four buildings in the living communities are of mixed-use "RB" but have been classified as "B", which is only partly correct.In contrast, Fig. 9(b) shows the objects in an industrial area.Here, approximately 50% of the buildings classified as "I" have been misclassified as "R".Fig. 9(c) illustrates a typical living community in China, featuring four gates.The buildings located outside the community serve residential functions along with other uses.According to the classification results, the pretrained model failed to predict the residential function for 13.6% of the buildings, and 24.4% of the buildings were incorrectly labelled as "B".Additionally, buildings categorized as "RBA" were misclassified as "RB".In Fig. 9(d), the classification results for a multiple-use area are presented.The category "RB" is prone to being misclassified as "B".Furthermore, several buildings were mistakenly labelled as "A".The yellow rectangle in Fig. 9(d) represents a primary school, and in the vicinity of these buildings, there are education-related businesses with names containing education-related words.Consequently, these buildings were mistakenly labelled as "A".
Fig. 10 depicts a portion of the detailed building use classification results for the Xiamen and Beijing research areas.In the case of Xiamen, the majority of the buildings has been accurately classified.Still, two samples categorized as "RBA" have been misclassified as "RB", and five buildings labelled as "BA" have been misclassified as "B" without identifying their function as "A".Additionally, 15 buildings categorized as "B" have been classified as "RB", implying the presence of a nonexistent function "R".For Beijing, the majority of buildings has also been correctly classified, while eight buildings categorized as "A" have been misclassified as "R", two buildings labelled as "B" have been misclassified as "RB", two as "BA" instead of "RBA", and one as "A" instead of "RA", thereby assigning an incorrect function "R".

Table 9
The statistics of F 1 score before and after decision fusion.

Comparative experiments
We have compared the proposed method with a state-of-the-art building use classification method: a decision-based multimodal deep learning method (Häberle et al., 2022).This comparative method uses a bi-directional long short-term memory network (LSTM) (Graves et al., 2005) to classify text information, and generated the probability of each category for each data set in the vector form.VGG16 (Simonyan and Zisserman, 2014) was used to classify each image, and generate its corresponding category probability vector.The decision method using the two data sets is shown in Eq. ( 10), where P t indicates the category probability vector for the text data, P i is the category probability vector for the image data, and λ is the weight given to text information.
Table 8 shows the samples that contain two modalities.We used the training data in Table 8 to train the LSTM model, and the training data in Table 2 which contain more image samples to train the VGG16 network.For these two networks, we used the same test data as shown in Table 8 to evaluate their performance.The decision fusion results are shown in Table 9 and Fig. 11.
From these experiments we see that: 1) Decision fusion represented by the comparative method has not effectively improved the building detailed use classification results.Compared to LSTM and VGG16 based decision fusion, our proposed feature fusion improved the MIFS by 6.2% and 3.0% respectively.2) For classification based on single modality, the MIFS of the POI classification is substantially higher than that of RS.For the decision fusion experiments, the highest MAFS and MIFS occur when assigning POI classification results a weight of 0.85 and results based on RS images 0.15.Therefore, POI data contribute more to buildings' accurate use classification.3) Using decision fusion, the contribution of RS images for building classification is limited and has not effectively improved the classification of POI.

Comparison between feature fusion and decision fusion strategy
We used the decision fusion strategy shown as Eq. ( 10) to fuse the classification results of POI data and aerial images classified by the Transformer network.The same network was used in decision fusion and feature fusion experiments.Results are shown in Table 10 and Fig. 12.
The MIFS and MAFS of the decision fusion are almost the same as that only using POI, which means in this decision fusion progress, adding RS classification results have not improved the overall results.The results of the proposed feature fusion method show that using feature fusion led to better classification accuracy, which shows that the relationship between different modality features can help improve classification results.
As compared to the decision fusion result in Table 9, Transformer performs better than LSTM in each category when dealing with POI data.When classifying RS images, the Transformer model demonstrates better performance in categories such as "R", "B", and "RBA", whereas the VGG16 model excels in categories like "A", "BA", and "RB".As shown in Table 2, samples with POI data in categories "R", "B" and "RB" are more than that in the other categories, and the F 1 score of "R" and "B" are higher than the other categories, irrespective of the use of VGG16 or Transformer.Hence, the MIFS of the Transformer model is 3.4% higher than that of VGG16, while the MAFS of VGG16 is 1.4% higher than that of the Transformer model.Considering the performance on different modalities, we conclude that the Transformer network outperforms alternative deep learning architectures and is highly effective for buildings use classification.

The comparison between revised MMBT and the original MMBT
We conducted a performance comparison between our revised network and the original MMBT network.Both models were trained using the dataset presented in Table 2, utilizing two modalities.For  Table 11 presents the comparison results.The original network exhibits better performance in the categories of "B", "BA", and "RA", while performing worse in the remaining five categories, particularly in "I" and "A".However, when considering the overall results represented by MIFS and MAFS, it is evident that our revised MMBT model generally outperforms the original one.

The matching problem between POI and building footprints
The performance of data-driven models is highly dependent on the quality and composition of the input data.The POI data plays a significant role in building use classification.In order to optimize the matching process between POI and building footprints, we experimented with different search radii and selected the radius with the highest F 1 score as the optimal search radius.The training progress for each radius is illustrated in Figs.18 to 21 in the Appendix.
Table 12 presents the classification results obtained using different search radii.It was observed that the category "RA" achieved the optimal F 1 score when using 0 m and 2.5 m as the search radius, although these radii failed to recognize the "I" category.The "B" category attained the highest F 1 score when using search radii of 2.5 m, 5 m, and 7.5 m.For the categories "RBA" and "BA", the optimal F 1 score was achieved with a search radius of 2.5 m.The categories "R", "A", "I", and "RB" obtained their highest F 1 scores with a search radius of 5 m.In general, 62.5% of the categories achieved their optimal F 1 score with a 5 m search radius, and this radius also yielded the highest MIFS and MAFS.Consequently, we recommend using a search radius of 5 m when matching POI with building footprints in China.

Conclusion
In this work, we proposed a new multimodal transformer-based deep learning method based upon feature fusion for building use classification.Based upon our research, we found that by integrating POI and RS data, we can extract detailed mixed-use information.Our proposed method outperforms the state-of-the-art methods in terms of performance.
We draw five conclusions.1) POI and RS images effectively reflect the building's detailed use information, including mixed-use.2) Compared to RS images, POI data reflect more functional information, while combining two modalities provides more information than using a single modality.
3) The proposed feature fusion strategy performs better than the state-of-the-art decision fusion method, and it increases the classification accuracy.4) Single use categories usually have a higher accuracy than mixed-use categories.5) Based upon the four case studies we hypothesise that the proposed method has a good generation ability for other major Chinese cities.The performance of our deep learning method heavily relies on the number of training samples.The accuracy is varying among different categories, for example, classification results of "RBA", "RA", "I", and "BA" with less samples are inferior to those with sufficient samples.Different application scenarios may have varying requirements for data accuracy.It is crucial to consider the uncertainty associated with the generated classification results before utilizing them.In our future research, we will enlarge our data set and add new types of data sources such as street view images to improve the building use classification result.We will then also explore use of our method in different parts of the world and will explore urban structures and unequal access to services.

Declaration of Competing Interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Wen Zhou reports financial support was provided by China Scholarship Council.
used RS image to obtain building outline and POI data to determine building use categories.Chen et al. (2018b) firstly classified POI, then integrated land use results based upon POI with information from other data sources.Chen et al. (2018b) and Liu et al. (2017) combined the classification of POI with other data features for land use classification.Bao et al. (2020), Feng et al. (2021) and Lu et al. (2022) transformed the classification results of POI data to image and combined these with other image features for land use classification.

Fig. 1 .
Fig. 1.Locations of four research area and their RS image.Beijing is a first-tier city, Zhengzhou and Wuhan are new first-tier cities, and Xiamen is a second-tier city.

Fig. 3 .
Fig. 3. Examples of the augmented image data sets.(a) Building's outline and its centre point.(b) Captured satellite image.(c) Captured satellite image rotated 90 0 to the right.

Fig. 6 .
Fig. 6.Training progress when using two modalities for building use classification.(a) Loss value at different iteration.(b) F 1 score of validation data at different iteration.

Fig. 7 .
Fig. 7. Confusion matrix when using two modalities for classification based on MMBT model.We divided the number of each cell's samples by the total number of its corresponding row.

Fig. 8 .
Fig. 8.The confusion matrix of three cities based on two modalities.(a) Result of Zhengzhou.(b) Result of Xiamen.(c) Result of Beijing.

Fig. 9 .
Fig. 9. Detailed building use classification result of Zhengzhou, (a) around a university, (b) around a industrial area, (c) around a living community, (d) around a commercial area.On the left is the RS image, in the middle is the classification result, and on the right is the reference label.

Fig. 10 .
Fig. 10.Detailed building use classification result of Xiamen and Beijing.(a) Classification result of Xiamen (b) Classification result of Beijing.On the left is the RS image, in the middle is the classification result, and on the right is the reference label.

Fig. 11 .
Fig. 11.The results of state-of-the-art decision fusion method and proposed feature fusion method.(a) The MAFS of decision fusion and feature fusion based methods.(b) The MIFS of decision fusion and feature fusion based methods.

Fig. 12 .
Fig. 12.The results of decision fusion strategy and feature fusion strategy.(a) The MAFS of decision fusion and feature fusion based methods.(b) The MIFS of decision fusion and feature fusion based methods.

Fig. 13 .
Fig. 13.Confusion matrix when using RS and POI data for the classification based on MMBT model.(a) Result only using RS.(b) Result only using POI.
W.Zhou et al.
W.Zhou et al.

Fig. 17 .
Fig. 17.The training progress of using original MMBT model for building use classification based on two modalities.(a) Loss value at different iteration.(b) F 1 score of validation data at different iteration.

Fig. 18 .
Fig. 18.Building use classification result using 0 m search radius to match POI data with buildings.(a) Loss value at different iteration.(b) F 1 score of validation data at different iteration.

Fig. 19 .
Fig. 19.Building use classification result using 2.5 m search radius to match POI data with buildings.(a) Loss value at different iteration.(b) F 1 score of validation data at different iteration.

Table 1
Considered use categories and the statistic of the generated data set.
R: Residential use; B: Business-related and commercial service facilities use.W: Logistics and warehousing use; A: Public management and public service facilities use.S: Roads and transportation facilities; I: Industrial use.

Table 2
The statistics of labelled building samples in different sub-sets of Wuhan.
W.Zhou et al.

Table 3
Number of labelled samples at the Zhengzhou, Xiamen, and Beijing study areas.

Table 4
Building use classification results of Wuhan based upon two modalities using samples of Table2.

Table 5 F
1 scores of different categories and their overall results trained by different data sources using samples of Table 2 based on the Transformer network.
W.Zhou et al.values equal to 62.3%, 55.1%, and 26.7%, respectively.The confusion matrices of results based on POI only and RS only are reported in appendix.The ablation experiments show that POI data contributes more to detailed land use information than RS images.

Table 6 F
1 -score of four cities' building use classification results.

Table 7
Truncation ratio of different categories in four cities.

Table 8
Number of buildings that contains both POI data and satellite image.

Table 10
The statistics of F 1 score before and after decision fusion.