Social media data for conservation science: A methodological overview

Improved understanding of human-nature interactions is crucial to conservation science and practice, but collecting relevant data remains challenging. Recently, social media have become an increasingly important source of information on human-nature interactions. However, the use of advanced methods for analysing social media is still limited, and social media data are not used to their full potential. In this article, we present available sources of social media data and approaches to mining and analysing these data for conservation science. Specifically, we (i) describe what kind of relevant information can be retrieved from social media platforms, (ii) provide a detailed overview of advanced methods for spatio-temporal, content and network analyses, (iii) exemplify the potential of these approaches for real-world conservation challenges, and (iv) discuss the limitations of social media data analysis in conservation science. Combined with other data sources and carefully considering the biases and ethical issues, social media data can provide a complementary and cost-efficient information source for addressing the grand challenges of biodiversity conservation in the Anthropocene epoch.


Introduction
Human activities are the main drivers of the ongoing rapid worldwide loss of biodiversity (Maxwell et al., 2016). Understanding humannature interactions is crucial for finding successful conservation solutions that help address the biodiversity crisis and support the wellbeing of the people (Bennett et al., 2017;Venter et al., 2016). Collecting data on human-nature interactions such as protected area visitation or resource extraction is, however, time-consuming and requires more resources than are usually available (Waldron et al., 2013). There is a need for new, efficient ways of collecting relevant information on people, nature and their interactions.
The ongoing information age is characterized by an increasing volume of data generated by user activities in virtual networks (Castells, 2010). Big Data, i.e., the massive quantities of digital information available provide new research avenues in various fields of science (boyd and Crawford, 2012;Kitchin, 2014;Ruths and Pfeffer, 2014). Digital conservation (Arts et al., 2015;Ladle et al., 2016) is a sub-field of conservation science that uses novel data sources such as social media data and other large data sets to understand and potentially mitigate the biodiversity crisis. User-generated big data may offer cost-efficient ways for biodiversity monitoring (Hampton et al., 2013), but more importantly, they allow studying human-nature interactions on an unprecedented scale (Ruths and Pfeffer, 2014).
Among other sources of big data, social media provide a rich source for studying people's activities in nature and understanding conservation debates or discussions online . Social media refers to "web-based services that allow individuals, communities and organizations to collaborate, connect, interact, and build a community by enabling them to create, co-create, modify, share, and engage with user-generated content that is easily accessible" (McCay-Peet and Quan-Haase, 2017). In this article, we focus particularly on data from social networking sites, microblogs, and media sharing services, which support textual and visual content and geotagging. These sites include social media platforms such as Facebook, Twitter, Instagram, Flickr, and Weibo, where users share content either in private networks or publicly online.
Information gathered from social media provide new approaches to studying visitation patterns in conservation areas (Levin et al., 2015;Tenkanen et al., 2017;Wood et al., 2013), preferences and activities of protected area visitors, and mapping cultural ecosystem services (Gliozzo et al., 2016;Richards and Friess, 2015; van Zanten et al.,

Previous research using social media data
We carried out a systematic literature review to understand how social media data have been used in conservation science. This review updates the previous review by Di  and provides an overview of current research and existing gaps in using social media data in conservation science. For details on how the literature review was carried out, see Supplement S1. Results of the literature review are summarized in Table 1. The search and selection of the literature resulted in 35 published journal articles, which we classified into three thematic categories, namely, "people in nature", "biodiversity monitoring" and "online discussions".
"Biodiversity monitoring" articles (n = 7) focussed on obtaining information about real-world species observations. These studies use different combinations of social media geotags, as well as text, image, and video content to retrieve information about species observations, often focussing on a single species and using manual data access methods (Campbell and Engelbrecht, 2018;Di Camillo et al., 2018;Dylewski et al., 2017;Havlin et al., 2018;Rocha et al., 2017;Schuette et al., 2018).
"Online discussions" articles (n = 9) focussed on understanding the discussions around and diffusion of nature-based content on the internet. Based on text analysis related to the topic or species of interest, the studies analysed media coverage of conservation-related events and news (Hawkins and Silver, 2017;Macdonald et al., 2016;Papworth et al., 2015), online reactions and discussions in relation to events and management actions (Ebeling-Schuld and Darimont, 2017;Greer et al., 2017;Lunstrum, 2017;Wu et al., 2018), and the analysis of speciesspecific information in online media (Jarić et al., 2016;Willemen et al., 2015).

Characteristics of different social media platforms
Social media allow people to share and exchange content in online networks (Kaplan and Haenlein, 2010). Various social media platforms enable users to share posts containing text, images and video, and users can like, share and comment on each other's posts, forming a network of users and content. Personal profiles and posts can be either private or public, depending on the platform and the user's settings (Lange, 2007). Many social media platforms allow users to geotag their posts, which makes social media data analogous to other types of geographic information (Sui and Goodchild, 2011). While 'social media' is a broad concept (Kaplan and Haenlein, 2010), here we focus specifically on social networking sites and content communities such as Facebook, Twitter, Instagram, Flickr, and Weibo, which are likely to contain T. Toivonen, et al. Biological Conservation 233 (2019) 298-315 Ghermandi, 2016;Gliozzo et al., 2016;Hausmann et al., 2018;Martinez-Harms et al., 2018;Richards and Friess, 2015;Sonter et al., 2016;Spalding et al., 2017;Tenkanen et al., 2017;Walden-Schreiner et al., 2018;Willemen et al., 2015) Facebook Social networking General purpose social networking site. Content (text, image, video) is shared on personal profiles, groups and organizational pages. 2 230 Facebook Graph API: https:// developers.facebook.com/docs/ graph-api/ Specific permissions are required for reading data using the Graph API. Access to Facebook Pages API, which provided access to public posts within a limited time, was closed in April 2018 (Freelon, 2018). N = 7 (Campbell and Engelbrecht, 2018;Di Camillo et al., 2018;Jarić et al., 2016;Lunstrum, 2017;Macdonald et al., 2016;Papworth et al., 2015;Rocha et al., 2017) Twitter

Media sharing
Video-sharing platform.
1 900 YouTube Data API: https://developers.google.com/ youtube/v3/ There is a daily quota limit for using the YouTube Data API. N = 4 (Di Camillo et al., 2018;Dylewski et al., 2017;Giovos et al., 2018;Macdonald et al., 2016) TripAdvisor N/A T. Toivonen, et al. Biological Conservation 233 (2019) 298-315 relevant information for studying human-nature interactions . Facebook is by far the most popular social networking site in the world, measured by the number of monthly active users (Table 2). Only in Russian-speaking countries, platforms VKontakte and Odnoklassiniki, and in China, QZone, are more popular than Facebook (http:// vincos.it/world-map-of-social-networks/). Twitter and Instagram are widely popular in different parts of the world, especially in Europe, the Americas, and India. While the photosharing site Flickr is not as popular measured in number of users, it is often used for sharing nature-related content , and it has been widely used in research for studying nature recreation and cultural ecosystem services (see references in Table 2).
Information about the platform user base is a key element when accounting for platform-specific biases and population biases in social media analysis (Ruths and Pfeffer, 2014). Social media usage varies among different population groups, but detailed statistics are difficult to obtain globally. In the U.S., as an example, social media are popular in all age groups, but young adults (18-to 29-year-olds) are more likely than other age groups to use social media (Kemp, 2018). Surveys conducted in Finnish and South-African national parks support the same finding. Specifically, younger visitors to national parks are more likely to share their nature experiences in social media compared to older visitors , while women post their experiences online more likely compared to men . The lack of detailed statistics makes it challenging to estimate representativeness of the data in different spatial contexts, for example, within a specific national park.
Social media platforms differ in the type of content people share on them (Kaplan and Haenlein, 2010;Thelwall, 2009). Media-sharing platforms such as Flickr and Instagram are rich in visual content and related text descriptions and comments, while microblogs such as Twitter consist primarily of short text content with embedded images and links to other online content. Photo-sharing platforms seem to contain more content about people's activities and on-site observations, whereas microblogs might be more focussed on discussions around specific topics. General purpose social networking sites such as Facebook may contain a mixture of different content types. Despite these platform differences, people might share the same content across multiple platforms. For example, many geotagged tweets originate from Instagram .

Data acquisition
There are different approaches to acquire social media data ranging from manual searches to programmatic access to data (Batrinca and Treleaven, 2015;Lomborg and Bechmann, 2014). These approaches vary in required skill level of the analyst and the volume of data acquired. For instance, manually browsing through social media groups is time-consuming but requires little computational skill, while obtaining data using automated tools typically requires technical knowledge about the interface, programming skills, and an appropriate computing infrastructure for continuous data storage. Below, we examine the main data acquisition approaches in more detail.
Application Programming Interfaces (APIs) provide a defined set of methods for interacting with social media platforms in a programmatic way. Social media platforms often provide an API through which third parties such as application developers and researchers can interact with the platform automatically, which makes APIs an efficient tool for researchers (Lomborg and Bechmann, 2014). There are two prevailing architectures for APIs: streaming APIs (Joseph et al., 2014) and REST (representational state transfer) interfaces (Masse, 2011). Streaming APIs continuously deliver newly posted messages, e.g., on a specific topic in a read-only format. REST interfaces are used by requesting specific data from the API, allowing more flexible queries, also back in time. Using an API often requires a so-called access token or application key, which the platforms use to track and limit API usage. The process of acquiring access to an API varies among platforms (see notes in Table 2). Paying for improved API access is an option in some platforms, for example, in Twitter. Companies may alter which data are accessible through their API (Lomborg and Bechmann, 2014), and restrict or eliminate API access at any given time (Freelon, 2018). See an example of changes in Instagram data quality in the Supplementary material (Supplement, S3).
Several data analysis software packages contain functionalities for viewing, retrieving and analysing data from social media APIs. There are both proprietary tools and free and open-source solutions, which serve the purpose of collecting content from different social media platforms. For example, the ecosystem valuation software InVEST calculates the average annual number of photo-user-days based on geolocated Flickr photographs as a proxy for nature recreation (Wood et al., 2013). TAGS (Hawksey, 2010) is an open-source extension to Google Spreadsheet to continuously collect Twitter data based on selected keywords, while COSMOS is a standalone software that can be used to collect and analyse Twitter data with an easy-to-use user interface .
There are also other options for social media data acquisition. The purchase of data from an authorized data vendor has several advantages such as little to no manual work and programming effort, and the availability to access time series of historical data. Costs may limit the practical usefulness of this approach in conservation science. Web scraping, or web crawling, is an approach for downloading and extracting data from web pages using an automated script. In comparison to APIs, web crawlers can only access the public web, while APIs may provide access to content that requires authentication (Lomborg and Bechmann, 2014).
There are various limitations when using computational tools for querying data from social media platforms (Brooker et al., 2016). While social media as such are subject to limitations and bias (boyd and Crawford, 2012), additional gaps and quality issues may be introduced when acquiring the data computationally (Brooker et al., 2016). For example, an API might return only a subsample of the requested data due to rate limits and access levels, and the quality of this sample may be difficult to evaluate (boyd and Crawford, 2012;Brooker et al., 2016).
Regardless of the data acquisition method, the researcher is responsible for storing and analysing the data in an ethical and responsible way (Lomborg and Bechmann, 2014;Zook et al., 2017). Each social media platform has its own Terms of Service that define how data acquired from these platforms can be stored and used by third parties. Clearly, these conditions should be acknowledged when retrieving data for research purposes (Batrinca and Treleaven, 2015). For example, web scraping often violates the terms and conditions of service providers, and researchers should understand the potential consequences of this violation (Freelon, 2018).

Elements of social media data
The information content of a social media post can be broken down into several elements: user information (full name, username, number of followers, user-defined home location), content (text, image, sound, video), timestamp (time when content was shared), geotag (automatic or user-defined location for the post), and comments and likes by other users (Fig. 1).
Text content is a core element of a social media post. Textual content often consists of short messages and captions that may include hashtags, emojis and external links. Additional text content is present in the comments related to the post. It is important to acknowledge that users post text in different languages and that a single post can contain text written in multiple languages . The language used in social media is often a spoken language containing abbreviations, emojis, hashtags and sarcasm. Twitter posts (tweets) are primarily textual, but tweets may also include interactive links to other external media (image, GIF and video, web-page). The length of a tweet is currently limited to 280 characters (140 characters prior to November 2017). In contrast, Instagram and Flickr are primarily platforms for visual content -each post is an image or video that may be supplemented by a textual caption. Additional text content is present in the comments related to a post.
Image content often contains photographs taken by the user but may also contain memes, infographics and other types of visualizations. The user may have applied a filter to the content, which changes the colours and brightness of the image. Based on the manual content of social media images shared in national parks, users often share content related to biodiversity, landscape, and various human activities Heikinheimo et al., 2017). In a Finnish national park, users shared more content related to people and human activities , while in Kruger National Park, South Africa, most of the analysed content was related to charismatic biodiversity . The type of shared photographs also varies between platforms. Flickr was used more extensively for sharing biodiversity-related content, while Instagram was relatively more used for sharing content about social aspects of the visit .
Geographic information is another relevant element of a social media post. Location information might reveal the exact location where the post was shared. Users might also geotag or mention places they want to discuss, without actually having been there (e.g., "I wish I could travel to #Yellowstonenationalpark!"). A geotagged social media post  contains latitude and longitude coordinates in its metadata. However, the spatial accuracy of these coordinates depends on the user and the platform (Hochmair et al., 2018). For example, in Flickr, a post can have the exact coordinates automatically derived from the camera device, or the user can add spatial location to a post after using a webmap application. In Instagram, users geotag their content by selecting a place name from a pre-defined list of locations (such as, 'Kruger National Park' or 'The Dolomites') and hence, the geotag location approximates the actual location where the photo was taken (Fig. 2). Consequently, micro-scale (e.g., trail-level) spatial analysis is not always meaningful even with geotagged data (Wu et al., 2017). Only a small portion of all social media posts are geotagged, but the small portion is compensated by the vast volume of social media data (Poorthuis and Zook, 2017). In addition to coordinates, geographic information is also shared using place names and hashtags within the textual content of a post.
Timestamps in social media data reflect the date and time when the user sent a post online, or when a photo was taken. Timestamps provide useful information on temporal patterns of observations and activities. For example, temporal changes in national park visitor patterns can be analysed based on social media timestamps ) (see also Box 3). Combined with content analysis techniques, timestamps may reveal public reactions to events and the time-lags for these reactions, for example, regarding the conservation of a specific species (see Box 4). Accuracy of the timestamp and the appropriate temporal scale of the analysis depend highly on the user who generated the data. Users might not post content immediately when an event happens, but only much later when, e.g., with good internet connection. It is also common to post old memories on social media, to reshare old content with the hashtag "#throwbacktime" or "#tb", or to talk about future events, making the timestamp of posting less relevant. Additionally, the sporadic nature of social media posting may hinder very detailed temporal analyses. If the temporal patterns are regular (for example, repeating weekly patterns in national park visitation), the data can be aggregated over longer periods of time to extract meaningful temporal patterns (Fig. 3).
The user profile of a social media account holder contains varying degrees of information about the user. Username, profile picture and a short description ("bio") available in the user profile can reveal different background characteristics about the user (Table 3). Some platforms allow users to add information about their home location/place of residence with a varying accuracy (Facebook, Twitter, and Flickr), while other platforms lack this information (e.g., Instagram). User profile information combined with content and spatial locations of posts has been used to infer further information such as age, gender, nationality, household type, and origin of the user (Longley et al., 2015). While these pieces of information are seldom accurate, they may provide relevant information on who posts what and when in social media  and how the preferences, activities, opinions or spatial patterns differ among different groups of people.
Likes and comments, as well as information about friends and followers of a user, can reveal interaction and communication on social media. Likes may reveal, for example, appreciation and preferences towards specific species , and comments may tell about the reactions and sentiment towards different topics. Likes, comments and friendships may reveal social networks around specific posts, topics and users. Information about social networks can be used, for example, for studying public reactions to specific events such as "rhino poaching" or "#cecilthelion", or to detect influential individuals and actors on social media.

Overview of social media analysis workflow
Extracting meaningful information from social media data is an iterative process (Fig. 4) that often follows a typical data mining workflow (Han et al., 2011), in which a researcher needs to consider questions ranging from the acquisition and storage of data in an efficient and safe manner, to cleaning, filtering, transforming and enriching the data into a format that can be used to conduct the actual analyses.
Social media data, like big data sources in general, are characterized by large volumes of data, internal variability, varying veracity and high velocity of accumulation (Kitchin, 2014). Applied research often benefits from making big data small (Poorthuis and Zook, 2017), i.e., extracting only relevant data for further analysis. Filtering is indeed needed as social media may contain data generated by bots, meaning automated data generation algorithms, and advertisers, and the data may be inaccurately georeferenced, purposefully sarcastic or contain merely a circulating meme or a funny cat video. Hence, cleaning and filtering the data are two of the most crucial steps in any social media data analysis workflow Varol et al., 2017). Transforming the structure and format of the data (for example, spatial and temporal aggregation ) and enriching the data with relevant metadata (for example, language identification Hiippala et al.
Sloan et al.
Sloan et al. T. Toivonen, et al. Biological Conservation 233 (2019) 298-315 ) might also be necessary before the actual analysis. At every step, but particularly when drawing conclusions on the results, a critical assessment of the results is needed. The results should be compared with other available information, if possible. Those interpreting the results should also be reminded about the biases, not to assume that social media reflects all behaviours of people. In the following sections, we focus on one step of this iterative data mining flow: the analysis. We describe contemporary methods for spatio-temporal analysis, content analysis and network analysis in the context of social media data mining.

Spatio-temporal analysis
Spatio-temporal analyses of social media data take advantage of the geographic information and timestamp elements of social media posts (see Section 3.3). Gathering spatially and temporally explicit data on human activities has become widely available through the use of smartphones. Smartphones record their locations both for the mobile network operator services, as well as through mobile applications such as social media (Kitchin, 2014;Sui and Goodchild, 2011). The data collected by mobile devices are used as proxies to understand the changing distributions of people, i.e., dynamic population, but also to study the movement patterns of people (Frank et al., 2014;González et al., 2008;Järv et al., 2014). At large, there are two main groups of spatio-temporal analysis approaches that are useful for social media data analytics, location-based and person-based analysis (see Box 1).

Location-based approach
A location-based (or place-based) approach is a relatively straightforward way of analysing the spatial patterns of social media posts and users using standard spatial analysis approaches from plotting the points on a map to calculating density surfaces or aggregating data over spatial units. Location-based approaches have been used, for example, to assess the presence of people in protected areas or to identify hotspots of human-nature interactions (Levin et al., 2015;Tenkanen et al., 2017). Such information may be complementary to official visitor statistics (see Box 1). Statistical models with ancillary data allow us to explain the variation in social media posts/user densities with various environmental variables, to gain understanding of, for example, landscape values or visitation preferences of people van Zanten et al., 2016;Yan et al., 2018). Longitudinal location-based analyses are useful when studying the temporal variation in protected area visitation or 'dynamic populations'. Such variation is difficult to track with conventional data collection methods Wood et al., 2013). Combinations of spatio-temporal and content analysis approaches for social media data (see Section 3.3) have been used to study spatial distribution of species observations (Willemen et al., 2015), landscape values (van Zanten et al., 2016), human activities in nature , and sentiments and semantics related to a place (Hu, 2018;Liu et al., 2015).

Person-based approach
A person-based approach, which focuses on human mobility at the individual level, requires much more challenging and advanced methods compared to location-based analyses (Järv et al., 2014;Rashidi et al., 2017;Wang et al., 2018). Person-based spatio-temporal analysis can provide more in-depth knowledge about people's actual movements and the places they have visited, for example, trips between different national parks (Fig. 5) (Hawelka et al., 2014;Heikinheimo et al., 2017;Huang and Wong, 2016). Additionally, comparing the location information of the post and the location provided by user's profile data may reveal global scale mobility patterns (Hawelka et al., 2014;Levin et al., 2015). Advanced analyses of visited locations and movement trajectories over a prolonged time period may benefit from additional human mobility analysis methods. These methods include anchor point modelling (Ahas et al., 2010) to reveal origins of social media users, space-time prisms (Neutens et al., 2008) to evaluate potentially reachable locations within a national park, and sequence alignment (Shoval and Isaacson, 2007) to reveal visitation frequencies between locations and national parks. Further, linking an individual's mobility pattern with the content of social media posts over a prolonged time period can provide additional information on visitors' background (see Table 3), their potential visitation preferences, and actual experiences. It could be possible, for example, to classify visitors as different tourist types such as eco-, adventure, culture and leisure tourists based on geotagged posts in previously visited locations.
Hence, person-based analysis is useful for profiling protected areas based on visitors and visitation features, and for monitoring needs for nature conservation based on mobility trajectories within and between protected areas. Not the least, this knowledge is essential for developing managing and marketing of protected areas (Sevin, 2013), especially where visitor monitoring systems do not exist.

Content analysis
Social media content analysis refers to a collection of qualitative and quantitative methods for systematically describing the content that users post on social media platforms. Quantitative analyses of social media content remain scarce in conservation science, and previous studies have relied mainly on time-and resource-consuming manual content analysis (Eid and Handal, 2018;Hausmann et al., 2018;Hinsley et al., 2016). However, research on artificial intelligence is making rapid progress in analysing and understanding visual and textual content, which has enabled computers to classify species in photographs, to identify patches of forests in satellite images and to evaluate the emotions expressed by social media users. Many of these advances emerge from an approach to machine learning known as deep learning, which uses artificial neural networks that can learn to perform tasks without human supervision, given the networks are provided with numerous examples of paired input data (e.g., an image) and the desired output (e.g., a label for an object in the image) (Goodfellow et al., 2017;LeCun et al., 2015). The following sections describe the advances brought about by deep learning in the fields of computer vision and natural language processing, which provide methods for analysing visual and textual content on social media.

Computer vision methods for visual content analysis
Computer vision is a broad, interdisciplinary field that studies the automatic processing and understanding of images. Visual content available on social media platforms such as photographs posted on Flickr, Instagram and Twitter can be analysed using computer vision methods. Visual content can be analysed, for example, to monitor species and ecosystems, as well as the threats they face. Computer vision methods can be used to identify species automatically for monitoring purposes (Norouzzadeh et al., 2018), to detect which species or wildlife products are being illegally traded on social media  or to identify cultural ecosystem services (Lee et al., 2019). Deep learning has achieved state-of-the-art results for challenging computer vision tasks such as classifying the contents of photographic images (Rawat and Wang, 2017), finding objects and identifying their outlines (He et al., 2017) and generating descriptions of entire images or their parts Karpathy and Fei-Fei, 2015;Zellers et al., 2018) (see Boxes 2 and 3).

Natural language processing methods for textual content analysis
Natural language processing (NLP) is a field that studies the computational processing of all natural languages such as English or Swahili. Natural language processing methods can be used to analyse textual content on social media platforms such as Instagram captions or Twitter posts. Applications of NLP methods in conservation science have been rare but hold much potential for extracting useful information from textual content (Becken et al., 2017). NLP methods allow, for T. Toivonen, et al. Biological Conservation 233 (2019) 298-315 instance, automatic language identification, sentiment analysis (e.g., negative posts related to crowdedness in a national park or the killing of 'Cecil' the lion) and named entity recognition for extracting the names of locations, organizations, individuals and species mentioned in social media posts. These NLP tasks have benefitted from recent advances in representing natural language using a technique known as word embeddings (Bojanowski et al., 2017;Mikolov et al., 2013;Peters et al., 2018). Word embeddings use neural networks to map words to vectors of real numbers, which can then be used as input data for downstream NLP tasks. However, applying NLP methods to social media content remains challenging due to the use of non-standard spellings, abbreviations, creative language and multiple languages

Box 1
Assessing National Park Visitor Movements -Case Study from Kruger National Park, South Africa.
Background: Nature-based tourism generates important revenues to support conservation and management in protected areas.
Understanding the mobility patterns of protected area visitors is crucial for developing conservation strategies both for local management as well as global marketing. Case example: Kruger National Park (KNP), South Africa, is a popular wildlife-watching destination, attracting both national and international tourists. Tourists actively share their experiences on social media during their visit. Spatial and temporal patterns of geolocated social media posts may help reveal visitors' country or region of origin. Inside the park, geolocated posts may reveal hotspots for human-nature interactions, as well as areas that may be exposed to potential pressure on biodiversity such as crowdedness. In Fig. B1  within a single post (Carter et al., 2013;Hiippala et al., 2018). NLP models are typically language-and domain-specific, which means that a model trained using English Wikipedia articles is likely to produce poor results when applied to data collected from social media platforms.

Combining visual and textual content analysis and current challenges
Social media users typically express themselves using combinations of visual and textual content such as a photograph and its caption. Accounting for this phenomenon, known as multimodality, is an emerging topic in automatic content analysis (Ramachandram and Taylor, 2018). These combinations can improve the performance of tasks such as sentiment analysis, as the combination of photograph and caption may prove more informative for determining the sentiment of a social media post than the photograph or caption alone (You et al., 2016) because the relevant information can be distributed across both visual and textual content. The same holds true for identifying species or human activities in social media posts. Captions may provide detailed information on the species or activity, whereas the computer vision model may recognize only the broader class to which they belong.
One drawback of using deep learning is the need for high volumes of training data for learning models that generalize well on unseen data. These training datasets are missing from conservation science and additional efforts will be needed to create them. This effort can be supported by deep learning, particularly for time-consuming tasks such as segmenting objects in images (Maninis et al., 2018), which reduces the resources needed for creating domain-specific, tailored datasets for conservation science. Shortcomings in data volume may be alleviated using a technique known as transfer learning, which involves learning representations on larger data sets and fine-tuning them to different downstream tasks such as classification (Weiss et al., 2016). Pre-trained models and training data exist for various computer vision tasks (Deng et al., 2009;Krishna et al., 2017;Lin et al., 2014), whereas word embeddings and other language resources are now available for a large number of languages (Grave et al., 2018).

Social network analysis based on likes, comments and followers
Network analysis is a common methodology in computer science, and it has also been applied in ecology and social sciences (Cumming et al., 2010). Social network analysis is an approach for mapping and measuring the relationships between people, groups, and discussion topics as well as the dissemination of ideas (Wellman, 2001). Technically, a social network consists of a set of nodes and links that connect these nodes to each other (Marin and Wellman, 2014). In social media data, nodes can be individuals, posts or topics that are linked through online interaction such as likes, followerships, friendships, comments or shares (Fig. 6).
Forming a social network makes it possible to understand how individuals interact with each other; how strong the social ties are between individuals; how ideas, values and opinions spread in the network; and how different communities form around specific topics or opinions (Chamberlain et al., 2018;Crandall et al., 2010;Croitoru et al., 2015;Grabowicz et al., 2012;Reihanian et al., 2016;Takhteyev et al., 2012;Zhang et al., 2010). Social network analysis could provide new information for applications such as understanding the relationships among users and sellers of wildlife products , identifying influential actors in a conservation debate (Malan, 2011), or evaluating the influence of social networks when estimating the economic value of tourism and biodiversity (Spalding et al., 2017). Of the various analytical approaches developed for social network analysis, community detection and centrality analyses may have the Fig. 5. A chord diagram -an aspatial visualization of a spatial analysispresenting person-based spatiotemporal visitation patterns between South-African national parks. The visualization shows the number of users who have posted content in different national parks in South Africa during 2014 in the case of visiting more than one park. The diagram also reveals which parks attract the same visitors. For example, Table Mountain and Kruger are often visited by the same people, while Kruger and Augrabies Falls are seldom combined.
T. Toivonen, et al. Biological Conservation 233 (2019) 298-315 highest potential for such conservation-related questions. Community detection approaches can be categorized broadly into topology-based and topic-based community detection approaches (Ding, 2011). The topology-based community detection algorithms such as the Link clustering method (Ahn et al., 2010), Clique Percolation method (Palla et al., 2005), or BIGCLAM (Yang and Leskovec, 2013) use information about social connections such as friendship, followership or communication between users to cluster nodes into distinguishable groups. Topic-based community detection aims at grouping and separating individuals into specific communities based on the topics that are discussed, or sentiments or opinions about a specific topic. Topic-oriented community detection (Zhao et al., 2012) combines natural language processing techniques with clustering and community detection algorithms that can reveal communities that are interested in specific topics such as a species or a protected area. Sentiment modelling can further reveal if the communities address a Box 2 Automated Content Detection from Social Media Images.
Background: Observations by nature enthusiasts contribute to species monitoring worldwide. In addition to citizen science platforms designed specifically for this purpose, there is an increased interest in using social media for collecting information about nature observations and human-nature interaction. Extracting relevant information from high volumes of social media data often requires automated content analysis, but note: it is advisable to check the social platform terms of use before applying these methods.
Case example: We show how two commonly used content analysis methods, dense captioning and instance segmentation, can be used to identify species or human-nature interactions in social media images. We demonstrate the differences between the methods by applying them to the same images. Dense captioning, which combines computer vision and natural language processing, detects visual regions of interest in the images and generates a linguistic description for each region. Instance segmentation locates objects in images and estimates their outline. This information can be used to count objects and to filter visual content for further processing such as more detailed species classification. The two approaches produce different types of content recognition, as shown with the examples.
Method: Dense captioning uses a pre-trained neural network with a VGG16 backbone, trained and implemented by the authors . Instance segmentation uses a pre-trained deep neural network, namely, Mask R-CNN (He et al., 2017) with a ResNeXt X-101-32x8d-FPN backbone, as implemented in the Detectron library (Girshick et al., 2018) (Fig. B2).

Fig. B2.
A comparison of results for instance segmentation (left) and dense captioning (right) for two pairs of images. Instance segmentation shows promising results for detecting objects, even if they are only partially visible or overlap each other, which enables, for example, counting the number of objects such as individuals of a certain species. Dense captioning provides richer descriptions, which may be used for searching image contents but lacks accuracy for object detection. In both cases, the results are limited to the classes and descriptions present in the training data: the model cannot detect objects that are not present in the training data. For more examples, see the Supplementary material, S4.
Background: Recreational activities in protected areas provide people access to cultural ecosystem services such as physical recreation and sense of place. However, the activities of visitors may cause direct or indirect disturbance to the environment. Inspecting the content of social media data in space and time provides increased understanding of activities and their relationship, e.g., to the phenological stages of nature. Case example: Pallas-Yllästunturi in Lapland is the most visited national park in Finland. The park visitation is extensively monitored through visitor counters and systematic visitor surveys, repeated every 5 years. The surveys reveal that hiking and skiing are the most popular activities in the park, depending on the season. Social media data provide complementary information about the activities in the park, as it allows the analysis of the activities of people over time. This information is useful for understanding temporal trends in activities, the relationship of these trends to changes in nature and for detecting emerging activities . Here, we demonstrate how social media text and imagery capture the temporal fluctuation of activities in different seasons.
Methods: We used Instagram posts extracted within a 10-km buffer zone around the Pallas-Ylläs National Park between January 2014 and April 2016 (N = 19,939) . We searched for all posts mentioning skiing, hiking and biking, and related environmental conditions (snow and autumn colours) in the most popular languages (English, Finnish) using a keyword search (Fig. B3).
Further readings: Heikinheimo et al., 2017 and Supplementary material. specific topic positively or negatively (see Box 4). Various centrality measures such as betweenness centrality, degree centrality or eigenvector centrality (Landherr et al., 2010) make it possible to identify key nodes from the social network such as opinion leaders or key informants of a social group that can be useful for targeting conservation marketing.

Most possibilities are still unexplored
Digital innovation, particularly social media analysis, has brought new research avenues to conservation science (Arts et al., 2015;. In this paper, we have presented concrete data acquisition and analysis approaches that might benefit conservation scientists interested in using social media data as an input for their research. We focussed on social media data retrieved from social networking sites and content communities (Kaplan and Haenlein, 2010), where users can share textual and visual content and geotag their posts. The methods presented for spatio-temporal analysis, content analysis and network analysis can also be applied to other types of data such as Wikipedia and other digital corpora (Ladle et al., 2016), camera trap imagery (Norouzzadeh et al., 2018) and various kinds of citizen-science data.
There is a rapidly growing body of literature presenting different use cases of social media data in conservation science (see Table 1). Most of the published research examples have used social media for studying human-nature interactions from a place-based perspective, for example, using social media as a proxy of visitation in natural areas (Sonter et al., 2016;Tenkanen et al., 2017). Indeed, most of the existing studies are based on mapping georeferenced social media data (Section 2, Table 1). However, there is potential to move "beyond the geotag" (Crampton et al., 2013) and to leverage the potential of more advanced social media analysis techniques also in conservation science. Applying more sophisticated analysis methods and examining several elements of social media data together, from spatially and temporally tagged content to information on users and their networks, can provide new information about human-nature interactions, particularly from the perspective of people. Ultimately, social media provide information on people's behaviour (Ruths and Pfeffer, 2014), and human dimensions of environmental issues, which are essential to understand for successful conservation actions (Bennett et al., 2017).

Big but small social media data
Social media data are generally referred to as 'big data' (Crampton et al., 2013;Kitchin, 2014), and the volume of user-generated data worldwide is overwhelming. However, the amount of data needed in end analysis can be reduced significantly if one is interested only in a specific topic or a specific region Poorthuis and Zook, 2017). The data are generally rich in terms of visual and textual content and variable across topics and languages . Despite the evident noisiness of these data, social media content has been found to contain relevant information for studying human-nature interactions, from relatively local to global scale Heikinheimo et al., 2017;Levin et al., 2015). Overall, social media data seem most useful for studying relatively broad areas that are frequently visited by people  or topics that are popular among social media users. On limited spatial and temporal scales, the content may become too sporadic for meaningful analysis (Fig. 4).
Data quality remains a challenge in any social media analysis workflow (boyd and Crawford, 2012;Longley et al., 2015). Data extracted from social media can be heavily biased towards specific user groups and geographic regions (Kemp, 2018). Social media users tend to share more positive and likeable content online, instead of negative experiences, causing a positivity bias in the content (Reinecke and Trepte, 2014). Photos and text are often meant to entertain rather than document and hence their value, e.g., biodiversity monitoring is varying. Despite the biases, social media content has been found to reflect surveyed preferences and activities of national park visitors in South Africa and Finland Heikinheimo et al., Fig. 6. Social media data make it possible to form a social network of a user based on combination of follower information, likers and commentators. It is also possible to form a social network around a post or topic by generating the network based on users who liked or commented on the post/topic. 2017), and social media usage rates correlate with official visitor statistics in popular nature destinations in different parts of the world Wood et al., 2013).
In addition to inherent biases in social media data, computational data acquisition methods may introduce additional problems to the retrieved data (Brooker et al., 2016). Most platforms offer only limited datasets and metadata to researchers, and the sampling algorithms for platform APIs remain unknown (Joseph et al., 2014). Textual content in social media is rich in information but difficult to analyse due to its variability. For example, detecting sarcasm when analysing online sentiment is challenging, particularly as posts are made with tens of languages. Specific words in one language may have completely

Box 4
Assessing Public Sentiment for Conservation -Case Pangolin.
Background: Big data mining approaches can be used to follow and analyse debates around conservation Ladle et al., 2016). Social media content analysis may reveal the volume of online discussion but also the sentiment users express in response to species or events and how these change over time. Understanding the sentiments of people towards conservation of species or protection of areas is important for developing conservation strategies and communication.
Case example: Pangolins are the most trafficked mammal species in the world. Awareness campaigns on social media have raised attention about the critical conservation status of the species. Social media content and sentiment analysis shows how much and with what sentiments social media users have posted about the species. With improved knowledge about public opinion, awareness campaigns can be made more effective, and public support for the conservation of the species can be increased.
Method: We collected Twitter posts (tweets) mentioning Pangolin in one of 20 relevant languages from February to May 2018, using the platform's API. We then used the sentiment classification framework of Webis (Hagen et al., 2015), which is optimized for short text length such as in tweets, to identify the sentiment (positive, neutral, negative) of all posts. Finally, we calculated a mean sentiment of all tweets posted in one day by assigning "positive" messages a value of 1.0, "negative" messages -1.0, and neutral 0.0. This results in a value in the interval [-1.0, 1.0] to describe the mean sentiment for each day in the dataset (Fig. B4).  B4. The volume of Twitter messages relating to the keyword "pangolin" changes over time, with users reacting to awareness campaigns, conservation events or similar news items (a). Sentiment analysis of the content of the post allows us to track positive, neutral or negative attitudes. The variation of the sentiment over time (b) can be used to identify prevailing topics and link them to specific events (c). World Pangolin Day (17 February 2018) accounts for a clear spike in both volume and sentiment. different meanings in other languages or they may refer to different things even with one language. As an example, geotagged posts about tigers may be found from Africa simply because people post about drinking Tiger beer.
Social media can also serve as a platform in citizen science campaigns, especially regarding studies related to biodiversity monitoring (see Table 1). For example, researchers can encourage interested citizens to post animal sightings in a special interest group on social media (Campbell and Engelbrecht, 2018;Rocha et al., 2017), or using a specific hashtag. Using popular social media platforms in active data collection campaigns may engage wider audiences in the data collection efforts and could save time from developing separate applications.
If possible, it is important to compare the patterns observed from social media with other data sources gathered with more traditional means of data acquisition such as surveys and statistics [see examples in Heikinheimo et al., 2017;Tenkanen et al., 2017)], to assess the quality of the data and to understand who and what the data represent. Social media data are created spontaneously by the users, and while this spontaneity generates a bias to the data, it is also a source of information revealing what and where people consider worth posting about.

Access to data and privacy issues
In conservation science, researchers have traditionally relied on their own data collection or data collected by conservation authorities, NGOs or active citizen scientists. These data are increasingly available online with clear licensing and well-described APIs . Social media data sources differ from this, as the platforms are first and foremost profit-making companies with business interests. Companies may, without prior notice, change the usage rights to the data, data structure, or access to APIs (Freelon, 2018;Lomborg and Bechmann, 2014), with consequences to data availability and quality (see Supplement S3). It may be challenging to keep up with ongoing changes in the APIs and use rights, as the changes are seldom openly documented. This limitation is problematic particularly for longitudinal research projects and monitoring, and it challenges the reproducibility of research (Lomborg and Bechmann, 2014). If planning for long-term data collection from social media, it is good to acknowledge the risk of discontinuity of the platform or data access.
One way to overcome the access challenges is an increased direct collaboration between researchers and the companies. At best, the broader research community of digital conservation could establish collaboration platforms with the social media companies. With established collaboration models and infrastructures in place, the administrative work related to contracts would be reduced and analyses could be run within the safe environments of social media platforms without breaching privacy fences. This kind of collaboration could be very helpful in, for example, investigating illegal wildlife trade on social media .
Moreover, new mechanisms could be established so that social media users are able to actively donate their data for research purposes via the platforms. Using automated content recognition techniques, Instagram has already prompted users not to share content that is abusive to animals. What if next time you post a picture of an animal or a plant in your favourite social media platform, you would receive a prompt: "Looks like you are sharing a nature observation. Would you like to share this post for research use?", allowing you to save a copy of your post to a research database. The newly implemented European Union directive GDPR supports this, as it obligates companies to store their data in a format that is portable, i.e., machine readable by other applications.

Future directions
In a workshop organized at the European Conference for Conservation Biology on 12 June 2018, we asked expert participants (n = 23) to identify new research avenues for applying social media data in conservation science (see Supplement S2). Brainstorming resulted in novel ideas for biodiversity monitoring, visitor mapping, as well as understanding values and sentiments -similar topics emerged from our literature review (Section 2). The majority of suggestions concentrated on gathering new knowledge on people's values and appreciation related to biodiversity, either in certain locations or related to certain species, or more broadly to understand nature-centred societal discourses. Social media data were also seen as a platform for active communication by individuals and organizations; a channel for nudging people towards more biodiversity-aware lifestyles.
We foresee significant advances in assessing these topics by using automated multimodal (text, image, video and audio together) content analysis. Rapid advancements in image recognition, natural language processing and other applications of machine learning provide possibilities to use social media data more comprehensively (see Box 4). Mining opinions and sentiments and other techniques linked to conservation culturomics (Ladle et al., 2016) becomes more accurate through improved training of algorithms to interpret correctly, e.g., sarcastic expressions or content in multiple languages. Advanced content analysis methods used together with spatial and temporal analytics may allow, for example, monitoring of public conservation discourses (Becken et al., 2017), investigations of illegal wildlife trade , 2019, identifying emerging opportunities and threats for biodiversity conservation in protected areas, or analysing return on investment of biodiversity campaigns.
When interpreting results based on social media data analysis, researchers must carefully consider ethical issues, minimize potential harm and aim at fairness . Privacy issues and data anonymization should be considered even if using publicly shared content. The potential biases should be borne in mind not to accidentally suggest that the data reflect the opinion or behaviour of the entire population. Overall, social media data are best used with other data sources to gain a full and dynamic image on a conservation issue, for the benefit of people and biodiversity.