‘‘photosearcher’’ package in R: An accessible and reproducible method for harvesting large datasets from Flickr

The social media website Flickr contains a wealth of spatial and temporal metadata, which can play an important role in environmental research including cultural ecosystem service and ecological assessments. However, the uptake of Flickr is potentially limited by issues with accessibility to the Flickr Application Planning Interface (API)


Scientific motivation
Biodiversity and social science datasets are key to many areas of environmental research, from understanding species areas are at risk from overexploitation by people, understanding visitation rates can be useful for proactive conservation [9]. Here, social and demographic data provided by social media sites can represent actual visitation rates [2], which present opportunities to understand how humans interact with nature and how best to inform management choices relating to conservation and ecotourism.
Here, we develop an approach to accessing data from Flickr (flickr.com), an image and video hosting site with a large database of photographs accompanied by accessible metadata. Flickr has advantages as a source of data as it has an active user base with up to 25 million new uploads a day [10] and generally a wider demographic of users than other social media sites [11]. Furthermore, the photograph's metadata can be obtained by making calls to the Application Planning Interface (API), an interface for accessing the Flickr server. This metadata usually contains a georeferenced location as well as the time and date the image was taken and has the potential to be used as a primary source of data for answering ecological questions. Flickr has already been successfully used in cultural services studies, such as wildlife watching [12], recreational activities [13], landscape aesthetic qualities [14], and visitation rates in both protected areas [8] and national parks [15]. Additionally, Flickr has vast potential as a source of biodiversity data [16]. For example, it has been demonstrated as a successful tool for cross-validating Global Biodiversity Information Facility records [17] and assessing ecological niches [18]. It has been suggested that Flickr could be utilized to explore not just cultural ecosystem services, but wider ecological questions at a large scale [19]. However, due to some limitations, the potential of Flickr as a source of data for a wider range of studies has yet to be fully explored.

Current limitations
Flickr has specific limitations that need to be addressed when using it as a data source. For example, searching for photographs for a given spatial location is restricted to searching via either a bounding box or a Flickr specific location identifier. This has meant researchers have added additional steps to data manipulation in order to download image metadata for specific search boundaries [20]. Furthermore, searches for photographs through the Flickr API will only return 4,000 unique results per search criterion, limiting the ability to access data easily for spatially or temporally large searches. For searches that have more than 4,000 results, the API will appear to get metadata for all of them. However, the Flickr API only returns data for the first 4,000 images, after this the following pages of data are duplicates of the first 4,000. This means users can obtain what appears to be more than 4,000 results but end up having only the metadata for the first 4,000 unique images repeated multiple times. Some authors have limited their number of returns per query to fewer than 4,000 to get around this [21]. This workaround potentially omits the full range of data available and introduces biases, such as excluding early or new users of Flickr, or missing temporal patterns. Furthermore, the use of the API currently has limited accessibility and reproducibility. First, the API can only be accessed through a range of programming languages including Python, R and Java. To access datasets authors must be well versed in a programming language. Within R [22] there is a set of generic packages that allow harvesting data through APIs. However, researchers who want to use these packages need to have an extensive understanding of the Flickr API as well as the numerous R packages needed to call to it. Second, authors rarely provide complete methodologies or their code, limiting the ability to replicate studies. To increase the uptake of Flickr as a source of data, there is a need for an application which makes API calls more reproducible and more accessible to all.

Related work
The use of an R package for making calls to the Flickr API improves the reproducibility of studies using this data as well as giving users control over what they search. The existing R package ''FlickrAPI'' (cran.r-project.org/web/packages/FlickrAPI/ index.html) provides some limited functionality of the Flickr API within the R environment. Other tools such as the Natural Capital Projects INVEST Recreational Tool (https://naturalcapitalproject. stanford.edu/) have also been developed to query the Flickr API. However, the FlickrAPI package only provides functions for obtaining information for a single known image and the IVEST tool only returns all images for an area. These tools do not provide functionality for searching based on criteria such as keywords or location. This, therefore, limits the functionality of these tools for ecological studies, which often require spatially explicit searches based on keywords, such as a target species. Furthermore, neither the FlickrAPI package nor the INVEST tool provides users with the functionality to download the raw images or return demographic data about Flickr users.

Software architecture
To overcome the challenges of using the Flickr API, we have developed the photosearcher R package (github.com/ropensci/ photosearcher), aimed at facilitating reproducible requests to the Flickr API. The functions in this package make calls to the Flickr API and return both the raw photographs and their additional metadata in accessible formats, whilst overcoming the current limitations of larger spatial and temporal requests to the API.

Software functionalities
The photosearcher package provides a reproducible way of accessing geotagged photographs through search queries as well as several other functions that provide data sets useful for a range of ecological analysis. The photo_search function allows users to define a set of search criteria, which are then queried against the Flickr database. A data frame containing the metadata for the photographs matching the search criteria is then returned. To enable the use of Flickr across different disciplines, the photo_search argument text allows for searches to be defined by keywords. Searches for images will then only return photos that contain the keywords in their title, description or tags. Users can also limit the searches to find keywords in the photographs' tags only. As well as keywords, other search variables include minimum and maximum date the photograph was taken and a search location, provided as a bounding box, spatial layer or a Flickr specific location (where on earth identifier -woeid see: flickr.com/places/ info/24865675). The ability to refine search parameters allows for a more focused approach to using Flickr's geotagged photographs by only returning those relevant to the study. The package also provides additional functionality for downloading images, getting user information and assessing related tags.

Spatial distribution and drivers of recreational cultural ecosystem services
The photo_search function returns a wealth of spatial, temporal and textural metadata. Here, we demonstrate the applications of this data by assessing recreational cultural ecosystem services, by searching for photographs of hiking in the contiguous USA. The photo_search function returned 160,923 photographs for hiking in the USA between 2015 and 2020 in 61 minutes (Fig. 1). To return metadata for this large number of photographs the bare minimum number of necessary calls to the Flickr API would be 644 (250 photographs per search). The photo_search function therefore makes a minimum of 10.55 calls a minute to the API returning metadata for approximately 2,637 photographs (NB in order to minimize errors the photo_search makes more than the minimum number of calls).
Like the photo_search function, user_info typically returns large social datasets in short periods of time. Here, the user_info function took just under 24 minutes to return information on 6,514 individuals, about 271 users per minute. Normally, to get a users' information you have to make a new call to the API for each individual, however, user_info function allows searches for multiple users at once, returning all available social data including hometown and occupation. The user_info function, therefore, provides an efficient method for obtaining large social datasets. Potential uses for the city datasets include network analysis to track travel route as well as to understand the social-economic drivers of supply and demand for cultural ecosystem services. By being able to assess rapidly where visitors travel from, protected area managers can inform visitor management plans. The social datasets could also be combined with ecological datasets for studies such as understanding human-wildlife interactions or ecotourism management. The hometown information can be plotted by geocoding their location with functions such as geocode_OSM function in the tmap R package (cran.r-project.org/web/packages/ tmap/) (Fig. 2).

Spatial and temporal distribution of species
To demonstrate the ease of using the photosearcher package for obtaining large ecological datasets, we utilize the photo_search function to find images metadata containing either the common or Latin names of a number of species (Table 1). The Flickr metadata can contain the complete date and time data, allowing for investigation of temporal distributions such as migratory patterns, diurnal cycles and floral phenology. Flickr may be best suited to large charismatic species that are easily identifiable by  the public, such as some birds [16]. The following piece of code outlines the basic search used (for a reproducible document see SI. 1).
species_name <-photo_search(mindate_take = ''2000-01-01'', maxdate_taken = ''2020-01-01'', maxdate_ uploaded = ''2020-01-01'', text = <species common or Latin name>, has_geo = TRUE) The photo_search function was able to return large datasets in short periods of time -i.e. returning 25,225 unique geotagged data points globally for the red fox in just over 14 minutes. These results reiterate that generally this method does not result in exceptionally long search times. Furthermore, the results demonstrate that large spatial and temporal searches would require a large number of API calls, for example, a global study for barn owls would require 70 calls to the API and searches for brown bears would require 87. As has_geo = TRUE, the returned metadata contains a latitude and longitude information, here we map the distributions of the photographs tagged with species names (Fig. 3). Users should be aware that species distributions based on Flickr photographs may have erroneous points. First, Flickr users may misidentify species. To overcome the issue of mistagged images users should properly define their search criteria i.e. using the Latin name, or with a shapefile of its known distribution, or users can use classification techniques to confirm which photographs have positive sightings. Second, some distributions can be influenced by visitor attractions such as zoos and museums. These erroneous points can be removed using CoordinateCleaner R package (cran.r-project.org/web/packages/CoordinateCleaner/ index.html). Furthermore, the temporal metadata can be used to assess change in species over time (Fig. 4). Here we demonstrate that sightings of brown bears vary monthly, with fewer sightings occurring during periods of known hibernation. This temporal metadata could be combined with the spatial data to assess migratory patterns, or with photograph contents (accessible via the download_images function) to assess animal behaviour or plant phenology.

Impact
photosearcher provides a more accessible and reproducible method of accessing the Flickr API, as well as overcoming limitations that prevent researchers from obtaining datasets. By creating photosearcher within the R environment it is freely available to all researchers. Furthermore, by consolidating the code into user-friendly functions the photosearcher package expands the accessibility of the Flickr dataset to non-data scientists. The simple functions also allow researchers to share their methods in a transparent and reproducible manner. However, we note that as people can add new uploads, edit metadata or delete their images, a search for the same criteria on two different occasions may return a different number of results. By providing arguments for limiting searches by the date they were uploaded the photo_search function helps to minimize any changes between repeated searches. This -combined with the ability to share the arguments used in the function calls or a full reproducible document (SI. 1) -makes photosearcher well suited to producing replicable results when working with Flickr data.
The photosearcher package allows researchers to obtain the full range of data available. To overcome the API limit of 4,000 results per query, photos_search requires the user to provide a minimum and maximum search date for when the photographs were taken. If the number of photographs matching the users defined criteria is less than 4,000, the metadata is returned. However, if the number of photographs is greater than 4,000 the metadata for the first 4,000 photographs are returned chronologically. The function then extracts the maximum date on which these images were taken and carries out a new search using this as the min-date_taken argument. The function does not assume that the new search contains fewer than 4,000 images and therefore checks whether the new search contains more than 4,000 results. In this way, the package will continue to dynamically split the initial search into new searches until it returns all available unique images from the initial search. The only time where all data may not be returned is if there were more than 4,000 images for a given second. As this process is automated it means users do not have to make additional calls manually to test which range of dates will return fewer than 4,000 results. Through using an automated method of splitting the searches, the photo_search function provides users with time and cost-efficient method of data collection. Furthermore, unlike the other software such as the INVEST tool, the photo_search function returns the full available metadata available for each photograph. This metadata can be useful for novel research by helping filter results to overcome some of the limitations of social media data. For example, by returning a Flickr-derived measure of spatial accuracy, users of the photosearcher package can quickly filter the returned results based on the accuracy of the spatial reference. Moreover, the anonymous user ID allows users to calculate visitation metrics such as photo-user-days [2], to overcome bias introduced by very active users. We have also provided an option to allow to supply a shapefile to search for a specific area. The photo_search function automatically transforms the provided shapefile to a bounding box which is then sent to the Flickr API to search for photographs. The function then extracts and returns only the responses from the original shapefile.
The other functions available in photosearcher are also designed to be useful in novel ecological assessments. For example, by returning the ID of the user uploading the images, additional analyses can be carried out using their publicly available data, returned by function user_info. Furthermore, the download_images function allows users to download the images themselves, which could be used for additional analysis or validation. The returned images could be classified by hand or through machine learning techniques to answer a range of ecological questions including the distribution of ecosystem services [19] and identifying plant species [23]. The plant species data set [23] was derived from the outputs of the photo_search function.

Conclusions
The R package photosearcher provides an easily accessible and reproducible method for accessing large datasets from Flickr. The simple skill set needed to use the photosearcher package will increase opportunities for use of Flickr data by non-data scientists. By addressing the challenges and limitations associated with API access photosearcher provides the basis for a standardized method for API calls. The photosearcher package provides both a quick and inexpensive method of gathering large quantities of data, with the methods presented here demonstrating how the package can help provide extensive biological and social data. We hope that the package allows future studies to build upon the current use of Flickr in cultural ecosystem service research, whilst facilitating users to answer a wider array of ecosystem service and ecological questions