Tales of a City: Sentiment Analysis of Urban Green Space in Dublin

Social media services such as TripAdvisor and Foursquare can provide opportunities for users to exchange their opinions about urban green space (UGS). Visitors can exchange their experiences with parks, woods, and wetlands in social communities via social networks. In this work, we implement a unified topic modeling approach to reveal UGS characteristics. Leveraging Artificial Intelligence techniques for opinion mining using the mentioned platforms (e.g., TripAdvisor and Foursquare) reviews is a novel application to UGS quality assessments. We show how specific characteristics of different green spaces can be explored by using a tailor-optimized sentiment analysis model. Such an application can support local authorities and stakeholders in understanding--and justification for--future urban green space investments.


Introduction
The lack of quality assessments underestimates the value of urban green space (UGS) due to a lack of information about what quality green space entails and how existing spaces within a city score on important social quality parameters [1]. Measuring the quality of urban green space is a tedious process. Observational approaches are often criticized since they require extensive repeat measurements at the same location, incurring considerable time and cost expenditures. Moreover, collected data quickly become outdated, making progress out of reach [2]. Since green space can be a powerful ecological instrument and a possible way to improve public health, city planners, urban designers, policymakers, and estate managers must maximize opportunities to apply greening strategies. Since increasing urban greening is unattainable, the focus should be on accelerating the quality of urban nature. But how can a city ensure that it provides safe, inclusive, and high-quality green space for all? Several attempts have been made to streamline and standardize quality assessments of urban green space [3,4]. Still, most proposed methods fall short, as they require expensive, tedious, and time-consuming field surveys and in-situ observations. AI has made its way into different areas, e.g., engineering, urban planning, management, and manufacturing to solve various problems [5,6,7]. Sentiment analysis [13] can offer a way to make evidence-based improvements for urban green space. Utilizing social media platforms such as Tri-pAdvisor to mine people's green space opinions can open up a treasure trove of global, comparative data. To date, no other crowdsourced data has been widespread enough to allow for such comparison between green spaces, both within a city and beyond. Thus far, sentiment analysis using online platforms as a data source has only been applied in the hospitality and tourism sectors. Here, shallow NLP techniques are applied to extract sentiment automatically. This paper uses two social media platforms to address this need and translate their reviews about urban green space into relevant, easily understandable insights for cities. The goal is to "mine" the reviews for opinions and extract the sentiment behind them by leveraging Artificial Intelligence (AI) [8,9,10] and Natural language processing (NLP) [11,12]. Such quantification can provide city planners, policymakers, and landscape architects an invaluable tool for understanding and justifying future investment in urban green space [14,15,16].
In this work, UGS in Dublin, Ireland, is analyzed by capturing and exploring public opinion. A tailor-optimized topic modelling method is implemented to classify UGS reviews and discover latent themes automatically. We discuss how social media platforms (e.g., Foursquare and TripAdvisor) and AI-based techniques can be utilized to assess the quality of urban green space cost-effectively. This paper presents an experiment's results to use NLP to extract citizen opinion on UGS quality, a novel application of automatic text mining and sentiment analysis. We present how text mining and sentiment analysis can help authorities understand how people value and use urban green spaces. Such intelligent approaches can help policymakers understand the needs and problems associated with UGS and allow them to make better decisions.

Urban Green Space in Dublin
UGS such as parks, woods, and wetlands represent a fundamental component of Dublin's urban ecosystem. It has become an extension of, or in many cases, replacing the traditional backyard, meaning more people are sharing less green space. Dublin City is divided into different postal districts illustrated in Fig. 1. Some of them (the odd numbers) are located on the North-side of the River Liffey, and the rest (the even numbers) are situated on the South-side. There are various green areas, i.e., parks, scattered throughout all these districts.
Social network platforms can enable us to utilize the abundance of open-source reviews. However, Dublin follows a arXiv:2107.06041v1 [cs.SI] 13 Jul 2021 Tales of a City: Sentiment Analysis of Urban Green Space in Dublin  similar pattern as other cities, where the most popular parks have significant reviews, and the lesser common parks have significantly fewer reviews. Therefore, we consider the most popular parks as a proxy for UGS. As discussed, this paper aims to utilize social network platforms and AI techniques to quantify UGS quality across Dublin based on a hybrid model. First, a location intelligence-driven technology (i.e., Foursquare API) is employed to find parks across Dublin, and the popularity of each park is assessed. Such a popularity measure is aggregated given all postal districts illustrated in Fig. 1. Then, associated reviews of all popular parks are extracted from the TripAdvisor platform. Then, different opinions clusters are extracted to identify related characteristics of parks. Relevant details are explained next.

Extracting Dublin's parks data
Foursquare API provides access to various capabilities, such as location search and venue details. The API has two different types of endpoints, i.e., regular and premium, each providing access to various information. It uses two forms of authentication, i.e., Userless Auth and User Auth. The former is used for server-side applications, while the latter uses OAuth 2.0 to provide authorized access to the API. In this work, we have used the Userless Auth authentication form. This authentication type requires the consumer key's Client ID and Secret instead of an access token in the request URL. Various information, such as location data, i.e., street address, postal code, longitude, latitude, and distance, can be fetched. We found around 60 parks across Dublin ( Fig.  2) and aggregated them at the postal district level. Some examples of such distribution are presented in Table 1. The API also provides access to information about each venue's level of popularity, i.e., the counts of users who have liked a venue. We used such capability to assess the popularity of all parks detected. Each observation (i.e., park) includes a unique string identifier, the postal district it belongs to, and the number of likes. The pseudo-code for the procedure is presented in Algorithm 1. Given the procedures described in the algorithm, the number of likes for each park is revealed (

Extracting Reviews
TripAdvisor reviews for 8 popular parks detected earlier were scarped using Selenium and Python. The pseudocode for retrieving data has been presented in our previous work [13]. The reviews were subsequently processed to focus on English texts. The reviews were collected from the period May 2006 to December 2020. The following review fields were extracted: review-title (the written title of review); review-body (written review about the destination); rate-value (1 is the lowest evaluation, 5 is the best); reviewlocation (where a reviewer is from); and review-date (date review was written). Some observations are presented in Table 3.

Topic Modeling
When analyzing text data, we need to perform data cleansing operations to ensure the dataset's quality. Text normalization was used to convert text into more convenient, standard forms. Tokenisation was used to separate words from running text. Tokenised words, then, were converted into a numeric representation, a process known as vectorization. All punctuations and stop words were omitted. To that end, libraries from the Natural Language Toolkit were used. Each review text was converted into a list of words. A stemming algorithm was also used for reducing inflected words to their word stem by using suffix striping to produce stems. All URLs, email addresses, extra spaces, and emojis that existed in reviews were excluded. Bigrams and Trigrams were created, and lemmatization was performed.
The aim in this phase is to implement an optimized method to automatically extract what topics people are discussing about the most popular parks detected in the prior phase.
The implemented approach is a variant of a Latent Dirichlet Allocation method tailored for this study to detect underlying topics [17]. The contribution of each topic was also evaluated. This unsupervised learning approach enables us to discover hidden semantic structures in review comments. The method takes a collection of reviews and a few parameters as the input. It generates a probabilistic model so that the extent to which each review belongs to a given topic is revealed. The topics produced by the model are clusters of similar words in reviews. The model assumes that reviews can be represented as a probabilistic distribution over latent topics. Moreover, the association among all topics is analyzed. To infer the latent structure, we should represent reviews such that they can be manipulated mathematically.
To do so, we represent each comment as a vector of features. A dictionary-based text categorization method was performed, a unique id for each word was created, and the output was converted to bag-of-words. In this way, the frequency counts of each word were determined. The vectorized corpus, then, transformed using a TF-IDF model. It transformed vectors from the bag-of-words representation to a vector space where the frequency counts were weighted given the scarcity of words.
The extracted review comments can be defined as a sequences of text, i.e., = { 1 , 2 , ..., } where refers to the th review. Let = { 1 , 2 , ..., } be the set of topic models. We can define a log-likelihood objective function as where ( ) = arg max log ( | ) is the topic identity of review . The algorithm should maximize the defined objective function. A topic (cluster) was considered as a probability mass function over all the words. It has a probability of occurring from 0 to 1, and the sum of these probabilities is 1. Moreover, each word was assigned an individual probability of occurring given a particular topic. Let Θ , be the probability of a topic generating a word . Let be the probability that review comments will generate a word from a specific cluster . The values were generated by random variables, e.g., Θ , with a Dirichlet distribution. Two different Dirichlet distribution functions were implemented, i.e., representing 1) how many words belong to clusters and 2) how associated clusters are with review comments. A Dirichlet distribution is a continuous multivariate density function with 2 parameters, i.e., the number of topics ( ⩾ 2) and concentration parameters ( = 1 , 2 , ... ) [18]. It outputs probabilities which sums up to 1, ∑ =1 ( ) = 1 . The Dirichlet probability density function is where 0 = ∑ 1 . A symmetric Dirichlet distribution was also tested, meaning each parameter has the same value. There is one more parameter, , which has to be set properly.
Such parameter assignment will be explained in the result section. Let be the i ℎ topics, which is a vector of words long. The probability of a cluster generating a word at a position is The goal is to maximize the probability ( , | , , , ).

Result
The implemented unsupervised learning method can help us discover hidden semantic structures. However, the number of topics should be determined and finding an optimal number of topics is challenging. To address this concern, perplexity and topic coherence were analyzed (Table 4) [19]. We considered our resultant topics to be sparse. Topic coherence was measured based on pointwise mutual information (PMI) by calculating word association between all pairs of words in all clusters.
where is the number of PMI scores over the set of distinct word pairs. Conditional log-likelihood of co-occurrence of top topic word pairs were also taken into account.
where is the number of top words in a cluster. Hyper-parameters (i.e., and ) affect the sparsity of topics and should be tuned properly. If a high value is assigned to the parameter, various topics are generated. Hence, this value was set to a fraction of the number of topics. plays a role in the sparsity of words in the topics, and a high value means a lower impact of word sparsity. A low value means the topics should be more specific [20,21]. Given the discussion, the parameter was set to 0.01. We have selected the values such that maximum Coherence score and minimum perplexity for 5 clusters were obtained (Table 4).
It should be mentioned that different datasets (related to most popular parks) were collected and the model was tuned separately. Extracted topics are a combination of keywords, each of which contributes a certain weightage to a given topic. Table 5 reveals the importance (weights) of different keywords in the topics extracted from St. Stephen's Green reviews. Given the achieved keywords, Topic 1 includes words like "shopping", "centre", "area", "middle", and "hustle", indicating a location-related topic. Topic 2 includes words like "maintained" and "kept", which sounds like a management-related topic. Topic 3 seems to be related to the historical background of the place. It includes words like "history" and "trinity" (Trinity College). Topic 4 consists of words like "swans", "ducks", and "flowers", a wild and plant life-related topic. And Topic 5 involves words such as "relax", "picnic", and "lunch", indicating a recreation-related topic. Fig. 4 reveals the keywords associated to other most popular green spaces in Dublin. It should be mentioned that most popular parks have significant reviews, and the lesser common parks have fewer reviews. Therefore, the proposed method was used on the most popular parks as a proxy for UGS.
The results indicate that user-generated content has the potential to drive insights about green space use and people's preferences towards green spaces. Such insights can be valuable for improving existing solutions and planning sustainable and socially equal cities.

Conclusions and Future Work
Research to inform both policy and design of UGS is critical to protect these vulnerable areas while simultaneously ensuring access to the potential health and well-being benefits these spaces provide. Green spaces play a pivotal role across all aspects of city life, and as cities densify, the importance of accurately and effectively measuring the quality of UGS has never been greater.
Questionnaire surveys have been employed as a social science research tool to study people's preferences and their relationship with UGS. However, since data collection is an expensive and burdensome task, such studies are limited. To alleviate this concern, new digital data sources can be used. This work presents a novel approach to UGS management using two online platforms, i.e., Foursquare and TripAdvisor. The former was used to detect the most popular parks, and the latter was utilized to extract reviews associated with those parks identified. The relationship among the topics was also taken into account. Different validity tests were conducted to select optimal parameters. The proposed model can assess the quality of different urban environments cost-effectively,