Using Social Media Images for Building Function Classification

Urban land use on a building instance level is crucial geo-information for many applications, yet difficult to obtain. An intuitive approach to close this gap is predicting building functions from ground level imagery. Social media image platforms contain billions of images, with a large variety of motifs including but not limited to street perspectives. To cope with this issue this study proposes a filtering pipeline to yield high quality, ground level imagery from large social media image datasets. The pipeline ensures that all resulting images have full and valid geotags with a compass direction to relate image content and spatial objects from maps. We analyze our method on a culturally diverse social media dataset from Flickr with more than 28 million images from 42 cities around the world. The obtained dataset is then evaluated in a context of 3-classes building function classification task. The three building classes that are considered in this study are: commercial, residential, and other. Fine-tuned state-of-the-art architectures yield F1-scores of up to 0.51 on the filtered images. Our analysis shows that the performance is highly limited by the quality of the labels obtained from OpenStreetMap, as the metrics increase by 0.2 if only human validated labels are considered. Therefore, we consider these labels to be weak and publish the resulting images from our pipeline together with the buildings they are showing as a weakly labeled dataset.


Introduction
Building function classification is the task of automatically identifying the settlement type of a given building, e.g., is it a residential or commercial build-ing?On the one hand, the task operates on a fine-grained level, i.e., single building level.On the other hand, it is an essential task for urban planning and decision making, especially in the big and dynamic cities. Historically, this task was manually managed by cadastral offices.However, this process is high resource consuming, and it could not catch up with the size and the speed of development of the modern cities.To cope with this issue, automatic methods are applied, where they mainly consume air-view images, such as aerial or satellite images [16,43].Although this kind of data is of high quality, it has inherent ambiguities from a nadir view looking at rooftops.

Motivation
Social media data, on the other side, is ubiquitous, cheap and easy to collect.It has become an essential and valuable source of information for many applications and scenarios [21].In our task, social media data shows promising features to replace classic air-view sources of data.First, it offers a ground-level view, which means a finer-grained and a different perspective source of data.Second, it is a more up-to-date source of information, or it is even a real-time source of information.Third, it is a gigantic source of cheap data.The only restriction in our scenario is that we need geo-tagged social media data.Fortunately, this is the case for a considerable share of data that is coming from social media channels such as Twitter or Flickr.For example, around 1% of all tweets are geo-tagged [37], i.e., given that around 500M1 tweets are published per-day, 5M among them are geo-tagged.Flickr does not disclose its photo statistics in detail, but announces having billions of photos already online 2 .By aligning geo-tagged social media content with open Volunteered Geographical Information (VGI) systems, such as OpenStreetMap (OSM), we could decode social media posts (e.g., tweets, images, etc.) to certain places on earth, and, hopefully, to certain buildings.However, one should not take social media as a no-cost source of information.One should be careful when dealing with social media as a main source of information, where it is a noisy and uncontrolled source of data.In addition, it is a sparse source, where not all spots on earth are equally covered by social media.For example, Flickr photos are mainly coming from city centers and hotspots.

Related Work
Urban land use classification is a challenging task: no matter on which spatial level it is done, there are inherent ambiguities.On the most fine-grained level, building instances, it is often hard to decide which land use a building represents.Especially in dense, urban centers mixed uses are the most dominant class.
Several studies investigated the feasibility of building facade images to address this problem.There are two main sources for such ground level image data: first, commercial street view data like Google Street View or Mapillary and second, social media platforms like Facebook, Instagram, or Flickr.
Especially Google Street View is a preferred source for this task as its data is accessible using an API enabling the user to define position, heading, pitch, and field-of-view.Additionally, Google has its own, standardized hardware to capture street view images and a tailored image processing pipeline to generate high-quality imagery on scale.
In combination with Google Places data, Google Street View data allows fine-grained store classification [29].This work builds upon Google Map Maker ontology and a GoogLeNet architecture trained on global sample of Google Street view.
Access to the Google Places is limited for research outside of Google and moreover, Google Places focuses on points of interest (POIs) and does not include data about residential buildings.As an alternative, building footprints from OpenStreetMap (OSM) can have semantic data as well, including details about building functions.This information can be used to label buildings shown in Google Street View images and hence, provides an additional way to predict land use on a building instance level [20].
The comprehensive coverage of buildings by Google Street View allows to have multiple images from different perspectives for a single building.This data richness can be used in a multi-modal architecture to include information from different sides while obtaining the labels from OSM [38].
Beyond land use classification information encoded in Google Street View images can be used to infer socioeconomic characteristics [8] or to map urban green in terms of tree detection and positioning [24].
However, the terms of service of Google Street View prohibit scraping, downloading, or storing images obtained using the API.This legal constraint limits the applicability of Google Street View data in research projects and requires to analyze other sources of data, e.g.social media image platforms.While Facebook and Instagram do not open their data for such purposes, Flickr turned out to be a valuable image source as they provide an easy accessible API and encourage their users to share photos with creative commons license.
While early works on land use classification with Flickr images used bag-ofvisual-word features for classification [26], more recent studies benefited from advancements in computer vision with CNNs and proposed land use classification using a scene and a object detection stream in parallel [44].
On a larger spatial scale Flickr has been used for mapping and understanding landscape aesthetics, either manually [23] or based on CNNs [33,13].Another field of application is flood level estimation.By formulating this problem as an object detection task with Mask R-CNN it has been shown that these images help to predict discrete levels of flooding [3].
If social media images are used for a specific application, it is crucial to deal with huge variation in motifs and scenes.Other data sources with a dedicated purpose but limited spatial extent can be a better option in some cases.For example, images from Geograph project3 are captured in a systematic way to cover Great Britain and Ireland.Its aim is to have at least one representative image for every square kilometer on both islands.These images can be used for predicting urban land use in London with object bank features [27,6].
Apart from Flickr, there is also Twitter as social media data source providing geo-located information with textual features.Although Twitter restricted their geo-tagging feature in June 2019 it is still a valuable source of geospatial data [21].To predict building functions it can be sufficient to have a set of geo-tagged tweets and build a classifier using their metadata [19].As tweets contain mainly text, the inherent linguistic features have also shown potential to help in urban land use classification on a building instance level [11] as well as on a venue level [40].
Furthermore, geo-located Twitter data reveal patterns in language use and provides insights into socioeconomic factors when related with demographics [1].When used in combination with Flickr data correlation between socioeconomic factors and parks visits show up [12].

Contribution
In this paper, we tackle the problem of building function classification using Flickr images.To make Flickr images useful for our task, they should first pass through a rigid filtering pipeline to eliminate noisy, irrelevant, and non geotagged photos.After that, a Convolution Neural Network (CNN) is finetuned for a multi-class classification downstream task.In this study, we mainly consider three classes of buildings from OSM, namely residential, commercial, and other.The main contribution of this paper can be summerized in the following points: • Building function classification using weakly labeled Flickr images.
• A content-based automatic filtering pipeline to eliminate irrelevant and noisy Flickr photos from large-scale and real-world datasets • A human-validated subset of Flickr photos for testing.

Methodology
Our method uses social media images to classify buildings settlement types.We follow a content-based approach, which is able to identify the main visual patterns for each class.

Homogenizing OpenStreetMap Building Labels
OpenStreetMap (OSM) is a Volunteered Geographic Information (VGI) platform meaning that users contribute mapping data in a Wikipedia-like style.OSM provides guidelines how this data should be structured and semantically enriched, but there is no strict enforcement.Therefore, tags for buildings are optional, just the building footprint coordinates are mandatory if added to OSM.OSM's guidelines specify three different tags that can be added for indicating a building function: building, amenity, and shop.
To summarize the information from all three tags, we use a mapping scheme that assigns each possible value according to OSM's guidelines of each tag to one of commercial, residential, and other.If more than one of these tags, building, amenity, and shop, is present, we make sure that they do not disagree.In case of disagreement, the building is not mapped to any of the classes.If there is only one tag or all available ones agree on the same mapped class, then this building gets this class.

Social Media Image Filtering Pipeline
Social media images cover different content and motifs, including but not limited to photography, digital art, cartoons.However, given a task like building function classification, most of these images are not helpful to solve the task.For our task, an image must have three features: 1. Shows a building 2. Has a valid geotag 3. Has a known compass orientation A filtering pipeline needs to identify all images fulfilling these three criteria in a social media image dataset.Additionally, it must account for big data to work on datasets with millions of images.Figure 1 shows the pipeline used in this study.It consists of five steps, starting with Google Street View similarity filtering and object detection filtering.These two steps together ensure that the first criterion is matched.We validate geotags in the next two steps: first, with an heuristic that discards images whose location is not unique.If there is another image at exactly the same position it is very likely that the geotag was manually edited.Second, we download the metadata for each remaining image and check if it contains a compass orientation.This serves as a stricter check for the second criterion and ensures the last criterion as well.Finally, we use the geotag including the compass direction for spatial referencing with OSM buildings.

Google Street View Similarity Filtering
This first step is a coarse filtering step aiming at finding images that are potentially helpful for building function classification.Previous studies showed the relevance of façade images to predict building functions [38,20].Therefore, this step is formulated as an image retrieval problem with a sample of Google Street View images as seed dataset S and a social media dataset D.
Features from deep neural networks are well suited to find structurally similar images.As they aggregate information with every layer, the final layers of a network are an abstract representation of the whole image.For example, the deep features of VGG16 [36] have been successfully applied in different domains for image retrieval [28,42,10,7].
In this study, features are taken from the last hidden layer of a VGG16 network trained on ImageNet [32].This process yields feature vectors v ∈ R 4096 .To assess similarity between pairs of images i 1 , i 2 , the cosine similarity s cos is calculated based on the feature vectors v 1 , v 2 : For efficient calculation, the features for all images of the seed dataset are calculated beforehand.Then, the features for all social media images are computed batch-wise and we calculate the pair-wise cosine similarity between the batch and the seed dataset.For each social media image in the batch we save the maximum similarity against all seed images, called the similarity parameter p sim : A threshold t sim is set as a minimum similarity value and all social media images with p sim < t sim are discarded.

Object Detection Filtering
The previous step is a fast check for structural similarity to a given seed dataset, but does not ensure that the social media images actually contain a building façade.Therefore, this step uses an object detection algorithm to find all objects in the images that passed the previous filter.
Applying the object detection algorithm yields a list of objects for each image.If this list contains either a house or a building it is a candidate for passing this filter.Each detected object comes with a size relative to the image and a confidence score.Based on these variables, there are two thresholds for adjusting if a candidate image passes the filter: t size and t score .Only if there is a building or a house with a size parameter p size that is larger than t size and has a confidence parameter p score higher than t score the image is passed to the next step.

Unique Location Filtering
The previous steps confirmed that the image content is relevant for the given task.Now, this step focuses on the geotag of the image.Geotags can be created in two different ways: either automatically by a GPS sensor of the camera or manually by the user.
This filter is a heuristic to identify images that have been manually tagged.If users have to pick locations of images by hand they tend to do it batch-wise, tagging multiple images at the same place.Otherwise, images tagged using a GPS sensor will have small differences in the position even if the photographer has not moved.GPS sensors constantly update their location estimate based on how many GPS satellites signals are available.Therefore, having two images with exactly the same position is a strong indicator that their geotag has not been measured by a GPS sensor, but manually added.In such cases, there is no compass orientation in the EXIF data and hence, this image can be omitted for the subsequent step.
More formally, an image i from a set of images I with location l(i) passes this filter if A note on implementation: to make this step computationally efficient, a sequential scan for each image is not feasible.If naïvely done, the geotag for each image needs to be compared with all geotags in the database.A geospatial index decreases the necessary checks by excluding geotags that are far away.Using a R-tree [9] allows to find the images in the very close neighborhood and a subsequent check on true equality is performed only on the geotags of these images.

Image Direction Filtering
This step is based on metadata of images, so called EXIF data.EXIF is a standard established by the Camera and Imaging Products Asscociation (CIPA) and the Japan Electronics and Information Technology Industries Asscociation (JEITA) [2].It defines fields for saving details about images including date and time of capturing, camera model, and camera settings.Moreover, it specifies how data from GPS sensors can be incorporated.This data can be a position of longitude and latitude as well as a compass direction.
For our pipeline, we assume that the social media database does not contain the original images including the EXIF metadata, but only a downsampled variant without original EXIF data.Therefore, we download the EXIF data for all images passing the previous filters as an intermediate step.Once all EXIF data are available, this step checks if the tag GPSImgDirection is present and rejects all images that do not have this tag.
Knowing the position where an image was taken is a necessary pre-condition, but only in combination with the compass direction a geospatial reference becomes feasible.Both information together allows computing a line-of-sight, which is crucial for the next step.

OSM Reference Building Filtering
This final step establishes a connection between buildings shown in an image and their representations in OSM.We use the position and the compass orientation to create a line-of-sight.All buildings polygons intersecting the lineof-sight are possible candidates for the building shown in an image.We select the building with the closest distance to the position as the reference building in the picture and set this as parameter p dist .Based on this parameter we add a forth threshold t dist to analyze the effect of the distance.
For evaluation, we add another filtering step that discards all images, which are assigned to a building without a semantic label.

Filtering Pipeline Summary
Having the pipeline in this order enables a content-first strategy while keeping the computational effort low.Additionally, the number of hyperparameters is small with four thresholds: 1. minimum seed similarity t sim 2. minimum object size t size 3. minimum object score t score 4. maximum building distance t dist

Fine-tuning CNN Architectures for Building Function Classification
To classify buildings shown in the social media images we fine-tune six stateof-the-art CNN architectures (DenseNet [17], InceptionV3, [39] MobileNetV2, [34] ResNetV2, [14] VGG16, [36] Xception, [5]).Starting with weights from ImageNet [32] we applied a two step approach to adapt the models for building function classification [15].We start with ImageNet models without the classification head and add a dense layer with three outputs to predict each of the aforementioned homogenized OSM mapping scheme: commercial, other, and residential.Please note that we fine-tune the models on the Google Street View seed dataset and use social media images only for inference to predict building functions.
As a first step, all layers are frozen and only the new, randomly initialized layer is trained with a learning rate of lr = 10 −4 for at most 16 epochs.Hence, the new layer is adapted to the existing weights and there is no risk of collapsing weights when trained on the full network.To prevent overfitting, a checkpoint mechanism makes sure that after training the model with the lowest validation loss is restored and used for the next step.
After convergence of the newly added layer, the whole model is set as trainable and fine-tuned in a second step with a learning rate of lr = 10 −5 .Again, we apply the checkpoint mechanism and create the final model based on the one with the lowest validation loss during training of 16 epochs.

Human Label Validation
To validate the labels obtained from OpenStreetMap buildings, we asked a group of humans to verify labels given to an image.Given a question if an image contains a commercial /other /residential building, they had to choose between three options: yes, unsure, or no.If no was selected, users were asked for the correct label in their opinion.
As our classification scheme does not include mixed use labels, we asked our voters to opt for unsure if no clear label could be assigned.To make the votes more reliable, our system requested three votes from different humans for each image.Once an image received three votes, it was not shown to any other user again.The users were not restricted in the number of images to vote on.

Experiments
We first introduce the two datasets used in this study and continue with describing the results of the different filtering steps.Moreover, we show the results of a Google Street View trained model on filtered Flickr images and dive deeper into the prediction performance by including results from the human validation setup.

Datasets
We evaluate our method using two datasets: First, a sample of Google Street View (GSV) images featuring buildings with known functions, and second, a Flickr image dataset captured in 42 cities distributed globally.

Google Street View Dataset
The Google Street View dataset consists of 43,392 building facade images, distributed to 14,512 commercial, 14,184 other, and 14,696 residential buildings.We apply a faster R-CNN [31] trained on OID v4 [22] to detect objects on all images and discard all images that do not show a building or house.This combination of architecture and dataset has the best trade-off between accuracy and speed [18].This yields a refined dataset of 7,698 images (2,743 commercial, 2,333 other, and 2,622 residential).
This Google Street View dataset is used in two ways: first, as a seed dataset for finding structurally similar images in the social media dataset, and second, for fine-tuning state-of-the-art CNN architectures on the given task.

Flickr Social Media Dataset
We collected Flickr image data in 42 cities across the globe to cover different cultures and continents.The images were obtained by querying the Flickr API with small random bounding boxes inside these regions of interest.With this approach we harvested 28,818,438 images.
Table 1 shows the number of images per city.The number of images per city correlates with the user distribution of Flickr, so we see the highest number of images in London (∼4.0M images).Second, there is New York City with ∼2.3M images and third, Los Angeles with ∼1.9M images.Except for Dongying, we found more than 5,000 images in every city.There is evidence that Dongying is a ghost city, meaning that the housing capacity outnumbers the number of inhabitants by far [25].

Filtering Pipeline Results
We evaluate our pipeline end-to-end by analyzing the effects of the four hyperparameters on the F1-score.For an architecture independent evaluation we calculated the mean probability vectors of all six models for each image.Using mean probability vectors of six models eliminates artefacts from single models and allows more general conclusions.Figure 2 shows the F1-scores and the remaining dataset size as functions of a threshold.Computing the F1score requires working on the final output of the pipeline with each image being assigned to an individual building.Hence, the complete dataset of 100% is based on 26,381 images, 8,070 labeled as commercial, 9,171 labeled as other, and 9,140 labeled as residential.
Our analysis is ordered by the appearance of the hyperparameters in the pipeline.First, there is the similarity threshold t sim setting how similar a social media image must be compared to the seed dataset (Figure 2a).Between 0.70 and 0.80 there is little difference in the resulting F1-score: It is almost constant between 0.50 and 0.52.At the same time the dataset size decreases from 100% to 2%.The F1-scores shows a first peak at t sim = 0.83 with a F1-score of 0.70 and a corresponding dataset size of 0.08% (23 images in total).For thresholds higher than 0.85 F1-scores become unreliable as the number of images is seven or less.Figures 3, 4, and 5 show examples of Flickr images having a p sim = 0.50, p sim = 0.75, and p sim = 0.90.
Figure 2b shows how the F1-score is affected by the object detection score p score .The figure starts with t score = 0.30 because objects with lower scores are not reported by the implementation we used.It increases slightly starting from t score > 0.30 with an F1-score of 0.50 up to 0.63 at t score = 0.93.At the same time, the dataset decreases from 100% to 3.6% (this is equal to 930 images).Setting t score > 0.965 yields an increase in F1-score to 0.70, but with only 0.2% or 56 images being considered.
The second parameter from the object detection filtering is t size , the minimum size of the building or house to be found in an image (Figure 2c.Using a threshold t size = 0.2 yields a F1-score of 0.50.Increasing the threshold up to t size = 0.56 results in a higher F1-score of 0.52, which is the highest possible As a last parameter in the pipeline, there is the distance between a photographer's position and the next building in the compass direction p dist .Figure 2d depicts the F1-score as a function of the distance.Please note that the threshold is an upper limit.Setting t dist = 0.0 yields a F1-score of 1.0 based on a single image.Increasing to t dist = 2.2 provides a first realistic value of 0.48 calculated on 4.8% of the dataset.Raising the threshold further to t dist = 40.32results in the highest possible F1-score of 0.52.At this point, 13,999 images, 69% of the dataset are included.Higher thresholds lead to a slight decrease of the F1-score down to 0.50 at t dist = 222.
Overall, the hyperparameters do not have too much influence on the prediction quality.We see that the F1-score is mostly stable around 0.5.Just in case of strict thresholds there are some exceptions: e.g.setting t score = 0.965 yields a F1-score of 0.70.Adjusting thresholds has more effects on the dataset size: On the one hand, fixing too strict thresholds yields a low number of images.On the other hand, this has a significant effect on the runtime of the whole pipeline.The more images a filter step at the very beginning the higher the over all computational time.This tradeoff needs to be taken into account when applying the filtering steps.Table 2 illustrates the number of images remaining after each filtering step when setting t sim = 0.70, t score = 0.3, t size = 0.2 and t dist = 250.Additionally, it shows how long it takes to process a single image in a filter step in our setup.
While the exact times will change with different setups, the relative comparison allows an assessment of the effectiveness.
Similarity filtering reduces the number of remaining images to less than 6% of the original dataset at high speed.Discarding all images that do not show a house or a building yields 891,861 images (3.09% of all images in the original dataset).However, this second step on content filtering takes more than 25 times longer than the similarity check.
Making sure that there is not another image from the very same location filters out 743,731 images, which indicates that almost half of all images were manually tagged.Utilizing a spatial index makes this step the fastest of all filtering steps with 0.2 milliseconds; a hundred times faster than the similarity check.Out of the remaining 457,670 images 88,593 come with a compass orientation.This is 0.31% of the original dataset.As this step requires downloading additional data using the Flickr API, it takes 1.33 seconds.Please note that most of the time is spent on waiting for the next API request to prevent being blocked by the platform (1 second).
Checking if an OSM building footprint is within the line-of-sight kept 73,207 images and limiting this to labeled OSM buildings gave a final number of 26,381 images.This is 0.09% of the whole dataset.This step makes again use of the spatial index, which results in the second fastest check of all steps.
There can be more than one image per building, especially touristic landmark buildings can be covered by several images.The 6,955 images from our filtering pipeline were mapped to 18,759 buildings.5,962 of them are commercial, 5,138 are other, and 7,659 are residential.

Prediction Results
Table 3 summarizes the performance of all fine-tuned models on an image level.Class-wise they behave similar with higher recall values on commercial and other labeled images and a higher precision value for the residential class.One exception in this pattern is VGG16, which has the highest precision score for other.The mean F1-score for commercial is 0.51, which is slightly higher than the F1-score for other, 0.47, and residential, 0.37.
Residential buildings can appear as single detached houses, townhouses, apartment blocks, or skyscrapers.While the first two forms of residential buildings are easy to predict, the latter ones can be easily confused with the other two classes.This is one possible explanation why we see a high precision for residential buildings, but a lower recall.
All architectures show a similar performance on the social media dataset.With respect to the weighted average, the Densenet121 and VGG model show the best F1-score of 0.47, but worst model has a F1-score of 0.45 (Resnet50).Hence, the prediction errors are not model specific, but rather a data issue.Two possible explanations for this behaviour can be either a domain shift when moving from Google Street View images to social media images or a data quality issues in the labels.To investigate the effect of OSM labels on the classification performance, we asked humans to confirm or disprove these labels.

Label Verification Results
For this experiment we selected a random subset of 1,500 social media images with OSM labels, 500 from each class, to be validated by humans.As we required three votes for each image, we got a sense of how difficult the task is for a person seeing only the image and the label.Out of 1,500 images 756 images have full agreement on their label, 744 received inconsistent votes.Full agreement includes three unsure votes as well, so in Figure 6 we focused on the images that received a clear vote, either correct or wrong.Overall, the accuracy of OSM is 69 %, but there are subtle differences between the three classes.Commercial has 63.5 % correct labels, which is the lowest value of all classes.On the other hand, residential images show 72.5 % correctness with other being similar (71.3 %).
To assess the true performance of our models we evaluate our models on the subset of images, which received either full agreement on the existing label or a full agreement on a new label.The right part of Figure 3 shows the F1-score, precision, and recall for this subset.The patterns with respect to precision and recall described above are the same, but all values improved by 0.2.
Hence, our models yield reasonable results if applied to clear data.In this case, the Densenet121 model yields the best F1-score of 0.72, with MobileNetV2 being second.The VGG16 model is among the worst with a F1-score of 0.67, while it showed up on the first place on the filtered dataset.It seems that the Densenet121 model has best generalized on important features for building function classification.
A big source of error are unclear, mislabeled, or mixed used buildings.Considering that almost have of the images in the human validated subset were did not receive any consistent vote from humans, the performance of these models is sensible.

Ambiguity of the Task
Our simple classification schema of commercial, other, and residential works best if buildings have such a clear function.In historically grown city centers, mixed use building are more common with a retail store on the ground floor and apartments on the upper floors.In these cases our three classes reach their limits and result in an error source.Especially if there is a large sign above the ground floor advertising a retail store on the ground this will likely cause a misclassification.Additionally, the other class is not well defined.Serving as an alternative if none of commercial or residential truly fit, there are a lot of different patterns pointing to the same final decision.This makes it hard for a CNN to predict this class.

Missed Images in Filtering Pipeline
Although we sampled our Google Street View image dataset on a global scale, there might be still types of buildings that are not covered.This will results in images being discarded although it shows a building and contains valuable geographic information.However, in this case the prediction would likely be wrong in the end as the seed dataset for filtering and the training dataset for fine-tuning are identical.Hence, there would be no benefit in including this image.A possible mitigation could be a more sophisticated sampling algorithm that includes rare building types.
The same applies to the object detection algorithm.If there are building types that were not in the training dataset of OID, images can be filtered out despite having a building inside, which would be correctly predicted.

Correctness of OpenStreetMap Labels
OSM's primary goal is to provide an open geoinformation service for users to orientate, navigate, and finding places-of-interest (POIs) for their needs.Therefore, commercial and other buildings, providing any kind of service for a society are more likely mapped than residential buildings that have no general purpose.Residential buildings are often bulk mapped, so that certain neighborhoods show a high level of completeness, while others do not have any building footprint at all.
Additionally, building functions may change: what used to be a church becomes an apartment building or is abandoned.Validity of labels depends on the activity of OSM's contributors.Hence, in regions with a lot of active contributors, labels will be more up-to-date than in regions with very few contributors.
Last but not least, in areas with few local active contributors like Africa, OSM buildings are mostly mapped by remote users looking at aerial imagery and drawing building polygons accordingly.In such cases, there will most likely be no semantic labels at all.

Completeness of OpenStreetMap Building Footprints
As a VGI platform the completeness of OpenStreetMap buildings polygons varies a lot.If a building footprint in OSM is missing, it can happen that our algorithm assigns an image to a building, which is actually behind the one that it shows.

Reference building calculation
Several images show street view perspectives including more than one building.In such cases, our line-of-sight algorithm will check, which building is the building of the image.Buildings on the left and on the right will be ignored.

Conclusion and Outlook
In this study we propose a content-first filtering pipeline for social media image datasets to extract geo-spatial information on building functions.By applying five filtering steps, we are able to find relevant images with valid metadata for the given task and relate them to buildings within the line-of-sight.The order of the filter steps ensures scalabilty on large image databases.Moreover, our pipeline has only four hyperparameters for balancing runtime and number of images yielded without strong influence on final prediction results.Based on human validation of our image labels from OSM we show that the limiting performance factor is more the data quality of OSM labels than the models used for predictions.The resulting image dataset with corresponding OSM building IDs and labels are published as a benchmark dataset for urban land use using social media image and weak labels from OSM.Additionally, we provide the human-validated subset with high-quality labels based on three independent votes.
Our pipeline has still many opportunities for refinement.While the cosine similarity measure against a seed dataset ensures fast processing speed, this image retrieval task can be enhanced with more sophisticated algorithms taking different aspects of an image into account [4].One of the most rigid filtering steps is discarding all images without a compass orientation.Recent approaches that estimate the compass orientation based on aerial imagery could be of help to close this gap [41,30,35].As a last step we relate the image to a building using a line-of-sight.Fortunately, the EXIF metadata contains data about the focal length opening a possibility to calculate all buildings that are within the field of view.Based on that an image could be separated into patches with different buildings found during the object detection step.This could yield predictions for many buildings from one image.Moreover, our classification scheme with commercial, residential, and other focuses on the most crucial classes for population estimation and disaster management.A more fine-grained, multi level scheme could provide more insights into urban development, e.g.education, transportation, health care.Another possible direction could be introducing multi labels to take into account mixed use buildings.

Figure 1 :
Figure 1: Filter pipeline for extracting Street View-like images from Flickr image database

2 Figure 2 :
Figure 2: Effect of filtering pipeline parameters on prediction results and remaining dataset size

Figure 6 :
Figure 6: Results from human validation of OSM labels as confusion matrix with only full agreement of human voters

Table 1 :
Number of Flickr images per city

Table 2 :
Number of Images remaining after each FilteringStep when using t sim = 0.70, t size = 0.2, tscore = 0.3, and t dist = 250.Execution time per image sample in seconds

Table 3 :
Prediction results of fine-tuned Google Street View models on filtered Flickr images and on human validated subset.Class labels are abbreviated as highlighted in bold: Commercial, Other, Residential, and Avg stands for the weighted average based on the number of samples .