Data Curation for a Community Science Project: CHIME Pilot Study

This paper introduces a community science project, Citizen Data Harvest in Motion Everywhere (CHIME), and the findings from our pilot study, which investigated potential concerns regarding data curation. The CHIME project aims to build a cyclist community–driven data archive that citizens, community scientists, and governments can use and reuse. While citizens’ involvement in the project enables data collection on a massive, unprecedented scale, the citizen-generated data (cyclists’ video data recorded with wearable cameras in the CHIME context) also presents several concerns regarding curation due to the grassroots nature of the data. Learning from our examination of cyclists’ video data and interviews with them, we will discuss the curation concerns and challenges we identified in our pilot study and introduce our approach to addressing these issues. Our study will provide insights into data curation concerns, to which other citizen science projects can refer. As a next step, we are in the process of developing a data curation model that will consider other factors related to this community science project and can be implemented in future community science projects.


Introduction
Public participation and engagement in scientific research is not new, and interest in citizen science -in other words, research projects that engage the public as scientific contributors -has recently grown (Wiggins and Crowston, 2014).Many citizen science projects occur within the domains of astronomy, ecology, and environmental science (e.g.Zooniverse family of projects, Savage, 2012;eBird project, Sullivan et al., 2009), but other disciplines also adopt public engagement, such as community science, defined as participatory community-centered social science research (Wandersman, 2003).
Citizen Data Harvest in Motion Everywhere (CHIME) is one effort to bring participatory community contribution to scientific research.The CHIME project aims to build a cyclist community-driven data archive from existing data resources as well as new participatory data that citizens, community scientists, and governments can use and reuse.One core component of this project, the focus of this paper, is citizens' (cyclists') involvement in data collection, which enables data collection on a massive, unprecedented scale.The volunteer cyclists will create video data using wearable cameras (e.g.GoPro or VIRB Garmin) mounted either on their helmet or handle bar and will submit the data on the project's website.Because the key to success in a citizen or community science project is volunteer participation, and because the data are collected in a grassroots manner, CHIME presents several concerns regarding the curation of citizen-generated data.
Many citizen science projects involve some issues regarding data curation, most commonly concerning data quality.Lagoze (2014) noted that if a task is repeatable (e.g.classifying crowd-sourced information), quality control might be a minor issue, as experts can review participants' work.However, if a task cannot be repeated (e.g.observing and reporting on birds), it is tricky to validate participants' contribution, which is a concern for CHIME.The project team also tried to identify other potential concerns regarding the nature of video data (e.g.data from human subjects, sensitive information, data including bystanders), such as privacy and ethics for curating the data.
To understand any potential curation issues that the project team might encounter, we decided to conduct a pilot study.We interviewed community cyclists who use wearable cameras to learn about their data creation and manipulation procedures and any concerns regarding personally identifiable information in the data.We also examined video footage generated from the cameras to understand the technical specifications and metadata requirements for curation of video data.In the next chapters, we will present an overview of CHIME as well as a detailed description of the context of our pilot study.We will also present findings from our pilot study and the challenges we identified regarding the curation of citizen-generated video data.

The CHIME Project Project Context: Big Picture
The CHIME project is an inter-departmental and inter-institutional collaborative project at Indiana University-Purdue University Indianapolis (IUPUI).The research team includes faculty from the School of Informatics and Computing (SoIC), the Department doi:10.2218/ijdc.v12i2.510 of Geography, the Transportation Active Safety Institute, the Department of Computer Science, and the Department of Tourism, Conventions, and Event Management.The motivation of this project is to build a community-driven mechanism to organize and make accessible data that can help individuals understand and solve social problems.For instance, in 2016, 30 pedestrians and cyclists suffered fatalities in the city of Indianapolis, with non-fatal accidents numbering in the hundreds. 1Understanding citizens' experiences with city infrastructure through citizen-generated data can help to improve the safety, health, and design of the city.
The long-term goal of the project is to develop a means for community members and scientists to use community-driven data archives to create, distribute, preserve, and analyze information and, ultimately, to improve society through a shared understanding.The community data archive will include existing data sources: city and greenway maps, census data, crime and accident data, weather data, and social media data documenting the urban cycling experience.As this is a community science project, citizen-generated data, which are current, digital, typically published to the web, and maintained on personal devices, will be an important aspect of the data archive.Cyclists will contribute to the data archive by uploading video data recorded via wearable cameras installed on their bicycles.This contextually rich data will capture community members' experiences with the city infrastructure as well as environmental changes.The project team will develop algorithms to analyze the large-scale collective video data in order to, among other things, estimate distances between humans and cars, measure light patterns, and determine risky riding conditions.These algorithms, which generate secondary data, will transform raw data to extract patterns to be mapped spatially with the data sources contained within CHIME, a problem-solving mechanism.Analytic and access tools will be developed for both primary and secondary data sources.These tools will be designed to provide democratic access to shared data about shared spaces in real time and over time.

CHIME Curation Pilot Project
Proper data curation is the key to project success and sustainability (i.e., ensuring that the data are reusable in the long term), and curation must be performed from the project's design phase to safeguard the data lifecycle and database design.However, the project team had limited knowledge regarding the characteristics of the data, particularly the data that participating cyclists will generate, collect, and share.In order to understand the nature and content of these data and to investigate potential curationrelated issues, we conducted a pilot study during CHIME's design phase, before data from cyclists is collected and integrated into the CHIME data archive.The pilot study was constructed in two parts.The first part included interviews with the cyclists who produced data, and the second part included an investigation of the video data that the cyclists generated.
We defined our theoretical population broadly as U.S. cyclists in any environment (e.g., road, trail, mountain, urban cycling) and with any purpose (e.g., leisure riders, commuters, competitive riders) who filmed rides with wearable cameras and shared the footage on a social media platform or website.The initial geographic scope was loosely limited to Indianapolis, IN, as this was where the project team was located.Indianapolis is a city in which a notable cycling culture has emerged alongside considerable public and private investments in bicycle paths, bikeways, and other infrastructure doi:10.2218/ijdc.v12i2.510Yoon, Spotts and Copeland | 223   (Indianapolis Cultural Trail 2 ; Indy Parks, 2004; Simmons, 2014); cycling facilities (IUPUI Office of Sustainability, 2016;YMCA of Greater Indianapolis, n.d.); and a successful bike-share program (Indianapolis Pacers Bikeshare3 ; Touhy, 2015).
Participants were identified and recruited through non-probability sampling techniques, including purposive, convenience, and snowball or chain-referral sampling.In order to identify active cyclists who utilized wearable cameras with their bicycles, we employed various strategies, including searching online for cycling videos published on several social media platforms (e.g., YouTube, Twitter, Instagram) and outreach to cycling organizations and community leaders.
A total of 13 interviews (12 phone interviews, one in-person interview) were conducted from September to December 2016.The semi-structured interviews addressed participants' experience with wearable cameras; knowledge of data creation, production, and management practices; and views on sharing and reusing their data, including potential concerns.Participants were encouraged to speak freely when they had more to share about a particular topic.The average interview lasted 30 minutes, but a notable exception lasted one hour.All interviews were recorded with a smartphone audio recorder application for Android, Sony Audio Recorder, and fully transcribed by a transcription vendor.
With permission from the participants, we also collected and examined video data (a total of 54 clips) that participants chose to share with us.Most participants provided one to three clips, but one participant uploaded 13.We sent out an email link to a private folder assigned to each participant on an unlimited cloud storage platform, Box, provided by Indiana University.Each participant accessed their designated folder and uploaded their videos.
The interview data were analyzed using Nvivo11 for Mac, a qualitative data analysis software.The data were analyzed using pre-developed codes, including five high-level codes of cycling behaviors, data characteristics, data practice, use/reuse of data, sharing concerns, and a number of sub-codes.Video data were analyzed using Microsoft Excel and coded for technical specifications (e.g., file format, resolution, size), content, and embedded metadata.

Findings Demographics and Characteristics of Participating Cyclists
The participants were overwhelmingly male (11 males, two females).Participants ranged in age from 20 to 66, with an average age of 40.Although we did not meet all of the participants in person, the majority were safely assumed to be white due to the available social media profile pictures and video evidence.Although our sample is not representative, our participants' demographic aligns with cyclists' general demographic characteristics; many surveys have reported that the majority of cyclists are men between 25 and 64 years old, and most recreational cyclists are white (Pucher, Buehler, and Seinen, 2011;Pucher and Renne, 2003).
All of the participants were experienced and fully dedicated cyclists.Many said they began riding during childhood and continued riding more consistently or intensely as an doi:10.2218/ijdc.v12i2.510adult.Participants rode for four to 37 years during adulthood, with an average of ten years.Six participants reported riding daily or "nearly daily" year-round for fitness or transit, and the remaining cyclists rode between one and seven days a week, primarily for health or dedicated athletic training.Six participants were highly invested and dedicated cyclists who are training for competitive cycling events, including criteriums, cyclo-cross events, road races, velodrome races, messenger competitions, Ironman competitions, and mountain bike races.While most cyclists reported fitness as a motivation for riding, four ride purely for pleasure, recreation and fitness, or transit.
The amount of time participants had used a wearable camera was considerably less than the amount of time they had been cycling.Most participants reported that they had used wearable cameras for less than three years, except for one who wore one for more than five years.Among those with one to three years of experience, many used only one or two cameras without any major problem, although one participant reported that he had to replace the camera five times during the five years he had used a camera due to physical damage.

Motivations for Using Wearable Cameras
Participants reported various reasons for installing wearable cameras on their bikes, but the most common was security.Participant CI03 mentioned the need for "a safety mechanism" in case of an accident: "most incidents that involve a cyclist, for us versus a car, even if the car is in the wrong, the car is going to win, typically."Many others echoed this sentiment.For instance, CI07 had an accident and consequently decided to buy a camera.Three participants reported that they were hit by cars one or more times.It seemed as though the cyclists were aware of the potential danger and wanted to ensure their safety while riding.
Several participants wore cameras to monitor their route or races.Daily commuters used wearable cameras to log their riding, and racers used them to evaluate different parts of the race.For instance, CI04 used a camera "to see how I did, to look at my data compared to what was happening." Many others used the camera for their own entertainment, including "to capture things that I saw during my ride" (CI01), "to share my experience with friends" (CI09) and convince friends of the benefits of cycling, and "because it's fun to film when riding around […] and making something for fun [from the recorded video]" (CI02).
Many participants also mentioned that they learned a lot from the videos, including information they did not expect to capture.The actual content captured (intentionally or unexpectedly) is described in a later section of this paper concerning Content Characteristics of Video Data.

Technical Characteristics of Video Data
The most common types of wearable camera used by the participants were the GoPro (9), followed by the VIRB Garmin (2), JVC Action Camera (1), and Polaroid Waterproof Sports Action Video Camera (1).Additionally, participants reported different versions of each type of device (e.g., GoPro Hero 3, Hero 4, Session, HD, Silver 3+).However, the slightly different specifications of each version (e.g.48 frames per second, video/camera/still image recording modes) did not significantly affect what the camera recorded and what they did with the recorded video, nor did differences among devices.For instance, the only reason CI09 preferred the JVC Action Camera was its outside screen, which allowed him to monitor what he records.All of the doi:10.2218/ijdc.v12i2.510Yoon,Spotts and Copeland | 225 participants were satisfied with the quality of video regardless of the device they used as all cameras' "ISO settings are so high nowadays" (CI10).Many recorded video in either 1080p HD or 720p HD (to preserve space on memory cards) with 48 or 30 frames per second.
All videos were initially created as mp4 files and maintained and used in this format by all participants.The size and length of each video was up to the participants' discretion, but the video could not be over four hours or, for some participants, over two hours due to the camera's battery life, which depends on weather conditions or the lifespan of the battery.The videos collected from the participants ranged in size from a few hundred megabytes to two gigabytes depending on the resolution, definition, and length.The participants said they had hundreds of these video files, which added up to a few hundred megabytes to several terabytes in total.The participants only shared a very small portion of the vast amount of video that was stored on their computer or hard drive.The remaining video could be a valuable asset for the CHIME data archive.
Not much information (metadata) was embedded in each video file.Only basic technical metadata was found, such as file type, size, date created, dimensions, and color profiles.Participants believed they could change the camera settings to improve the resolution of the video or to add more metadata, but they did not see the need to do so.

Content Characteristics of Video Data
Before the participants decided what to film or where to film, they determined where to mount the camera.Typically, cyclists mount cameras either on the head tube or handle bar, but some place them on the shoulders or backside of their bike.CI09 argued that it is important to properly mount cameras as it influences what cyclists try to capture and how they do so (e.g., a helmet mount captures the view point of the cyclist, while a chest mount captures that of the bike).This may be an important consideration in future data collection for CHIME, depending on the reason why information was captured through video.
The examination of video and interview data revealed several categories of information:  Safety-related information: Many participants reported that they captured video for safety reasons.The videos could record not only cyclists' own accidents but also random car accidents happening on roads or during the races, "even in different angles" (CI04).Road conditions were also captured, such as risky potholes, icy roads, and invalid road signs.
 Behavioral information: Participants also realized their video captured behavioral information about the people they run into on the roads.Drivers' carelessness towards cyclists was a common behavior; as CI03 noted, "drivers often turn into a bike lane, park in it, or honk for no particular reason." Racing/cycling techniques: The videos also captured information that can be used to train for cycling and racing, such as proper posture, the proper way to ride (e.g., "how riders are riding in a line close together in front to back" (CI06)), and different ways to ride. Landscape and/or infrastructure changes: Because landscapes are inevitably included in any video that cyclists generate, changes in landscapes were doi:10.2218/ijdc.v12i2.510automatically captured as longitudinal data.CI09, a mountain biker, said that when he biked "the landscape and the colours of the rocks were awesome" and his videos were a good source of knowledge about "what is going on in remote areas."For city bikers (urban cyclists), this applied to any changes in city infrastructure and landscape, such as road construction, landmark signs, and city events.CI01 argued that this type of information could be used to design public spaces.

Participants' Data Editing and Manipulation Practices
We also asked about participants' general data practices, including their data editing and manipulation behaviors.Participants did little or no data editing for their own use and storage, usually only importing data from memory cards and transferring it to hard drives using the editing program designed for each type of camera (e.g., GoPro Suite).However, when sharing the video with others through social media, almost all participants did some level of manipulation before sharing, except for one participant, who just "upload[ed] directly to YouTube" (CI01).Participants reported the following reasons for editing videos:  Trimming video that was not of particular interest: Participants said that the raw videos were usually too long to be shared with the public and that people will not watch cycling videos longer than five to ten minutes.
 Dropping the audio: Some participants did not see the value of audio because "the native audio on the video's usually pretty crappy […] and doesn't pick up much of anything" (CI06).Some replaced it with a song to make the video more artistic and enjoyable. Adjusting resolution: A few participants adjusted the original resolution (1080p or 1020p) to a lower one (720p) for sharing to decrease the upload time. GPS mapping: While not all devices support GPS, VIRB supports GPS overlay.
Thus, one participant who used VIRB performed GPS mapping, combining three video files into a chronological hour-long race recording. Cutting out any identifying information: More than half the participants mentioned that they cut out "real obvious identifiers" (CI07).Many preferred not to record any identifying information, such as video "around my house [… because] if you really want, you can find out where I live" (CI02), but if they did record this information, they edited it out before sharing.CI01 was particularly concerned about children (even captured in public spaces); if they were in any portion of the video, he edited them out.CI07 either blurred or edited out information about drivers, such as license plates or faces (if they are recognizable).However, not all participants were equally concerned about privacy.For instance, CI09 did not blur anybody's face in his videos unless asked to as they were randomly captured.

Curation Challenges and Project Approach
In this section, we discuss several major challenges that we identified when curating cyclists' video data for the project.Our approaches to addressing those concerns are also presented.

Ensuring data quality for reuse and curation
As in many citizen science projects, the pilot study revealed that managing data quality might be an issue.While there are different ways to define and understand data quality, it is critical to address data quality from the perspective of both data reusers and data managers.'Fitness for use,' one of most common definitions of quality (e.g.Madnick et al., 2009;Wang and Strong, 1996) is an important consideration from both perspectives in the CHIME project.From our assessment of the video content, we learned that the videos contained much information that is valuable for the project even though cyclists did not intend to capture it, as described above.However, we also learned that there was a great degree of variation in terms of the quality of content (i.e., its value to the project), which depended on the cyclists' purpose for recording the video.For instance, it is likely redundancies in geographic representation or nearidentical routes or riding conditions could accumulate, as cyclists may commute or train along limited routes.In addition, how the cyclists utilize and mount the cameras, and thus how information is captured, can influence the quality of content and sometimes lead to a failure to capture what cyclists intended.
The technical quality of video (e.g., low resolution, device errors, file format) is a less important issue for long-term preservation, according to the participating cyclists.However, little or no contextual information about the data (metadata) was embedded or documented by the cyclists.Documenting contextual information is critical for ensuring successful reuse and curation.

Collecting contextual information
While collecting contextual information is significant, the creation of metadata for video data requires participants to volunteer additional information.Some technical metadata was embedded in video files and can be automatically extracted, but participating cyclists would have to fill out the remaining metadata fields.Many cyclists participating to this study were not in the habit of creating any descriptive metadata, recording context, or organizing files for efficient future identification and access, other than organizing video files by date or occasionally race name.When cyclists submit recent data, this may not a problem, but if they would like to upload old data, concerns regarding accuracy arise.

Dealing with potential privacy issue
Perhaps the most challenging aspect is dealing with potential issues regarding the privacy of data.The pilot study presented two key privacy considerations -namely, the collection of sensitive location-based data (Shilton, 2009) and the accidental collection of data from secondary participants (passengers or cars), such as those depicted in videos (Henne, Szongott, and Smith, 2013).Participating cyclists have different levels of awareness of those privacy considerations.Many actively edited their data to project doi:10.2218/ijdc.v12i2.510their identity or that of passengers, while some did not publish any identifying data.Capturing public spaces as well as individuals in these spaces does not violate any current policy, and researchers are not responsible for ensuring privacy in videos according to the Institutional Review Board.However, this may cause some concern or discomfort among the public, as a few participants noted, which does not align with the intention of the project.In addition, cyclists' routes can include both public and private spaces, which the cyclists may or may not realize.If they are unaware, it can be tricky to determine the boundaries between public and private spaces and edit the route according.

Data storage and access guarantee beyond the project lifetime
Long-term data provision and preservation beyond the project period (two years) is an important consideration for sustainability.In the pilot study, one cyclist's data collected over a couple years reached a terabyte in size.This is partially because some maintained everything ever filmed over years of cycling, with no regard to significance of content, size, versioning, quality, or likelihood of future retrieval.Still, it is not difficult to imagine how large the CHIME data archive (which integrates other types of data) can become in the long term.The value of the data in a larger project could risk dilution by the contribution of large quantities of unassessed files from individuals, and investment in additional server space and infrastructure could be significant.The project server will be hosted at SoIC during the project period, but the cost of data storage and maintenance beyond the project's lifetime will be a real challenge.

CHIME Approach
We developed several strategies to address the identified concerns and challenges.While these are not definite solutions to all concerns, we believe they will serve as a starting point as the project moves forward:  Developing standards for equipment, instruments, data format, and metadata: Providing a standardized platform for data collection will contribute to data quality for reuse and curation.
 Minimal work requirement for participating cyclists: While data needs to meet the project's requirements, 'low barriers to entry' is a key principle in citizen science projects.For instance, we will make the data submission (or deposit) and metadata requirements minimal, with automatic metadata extracted from the file.These minimal requirements may lead to questions about what is 'minimal enough' for preservation and reuse and may require geospatial metadata standards and preservation metadata standards to be integrated.Identifying metadata requirement for the project is critical component and will be further investigated. Developing a training program, tutorial, or participation guide: Training is a good mechanism for controlling quality and ensuring participating cyclists' compliance with our privacy concerns.In particular, undergoing training before data collection is useful in this project context, as we can educate cyclists regarding the value of video data and usage within the project context and beyond, proper use of equipment, contributing to the project, and rules to follow.Instructions on how to avoid capturing private spaces (e.g., by turning off the camera) will also be provided. Community watch: Like many other citizen science project that include volunteer reviewers (Cooper, 2016), as part of training and ongoing communication, 'citizen archivists' could be apprised of the type of data that is lacking in the collection: by geographic area, infrastructure type, by riding condition or types of interaction with other road users.This would help engender the creation of a more representative and comprehensive collection. Internal review of data: While the project can utilize volunteer reviewers or citizen archivists to validate some aspect of data, our project better allows internal reviews for privacy compliance as another safeguard. Community-oriented approach to privacy concerns: Our initial approach to privacy concerns was both policy oriented (e.g., training participants for privacy concerns) and technically oriented (e.g., internal review of data).In addition, we also learned that an additional community-oriented approach is necessary to address any concerns from community members and participating cyclists. To improve the scope of content, and possibly the technical quality, one consideration would be to create selection parameters for contributions that would be retained for the long term storage.Selection parameters, or collection development guidelines, could be pre-production, post-production parameters, or a combination of both.

Conclusion and Future Plan
To integrate these findings and address concerns regarding the design of our data curation method, we are in the process of developing a curation model that can be used in our project and beyond.This curation model will be a useful resource for other community science projects that need to implement appropriate curation procedures throughout the lifecycle of their data.