Sharing selves: Developing an ethical framework for curating social media data

Open sharing of social media data raises new ethical questions that researchers, repositories, and data curators must confront, with little existing guidance available. In this paper, the authors draw upon their experiences in their multiple roles as data curators, academic librarians, and researchers to propose the STEP framework for curating and sharing social media data. The framework is intended to be used by data curators facilitating open publication of social media data. Two case studies from the Dryad Digital Repository serve to demonstrate implementation of the STEP framework. The STEP framework can serve as one important "step" along the path to achieving safe, ethical, and reproducible social media research practice.


Introduction and Background
In a networked society-and especially in the online communities facilitated by social media-human thoughts and activities take the form of data that can be scraped, downloaded, aggregated, and otherwise collected on a massive scale. Academic researchers have identified this data as a potential source of insight into human behavior, and social media data is increasingly being used for scholarly inquiry (Kietzmann, Silvestre, McCarthy, & Pitt, 2012 ;Zimmer & Proferes, 2014 ;Ngai, Tao, & Moon, 2015 ). At the same time, funding agencies and academic journals are implementing data sharing policies (NSF, 2011 ;PLOS, 2014 ;Bill & Melinda Gates Foundation, 2015 ), and the scientific community is embracing data sharing as a strategy to promote research reproducibility (Collins & Tabak, 2014 ;Ioannidis, 2005 ).
Research with social media data doesn't neatly fit into the traditional definition of human subject data outlined decades ago by the Belmont Report ( 1979 ) and the Common Rule ( 1991 ) (Metcalf & Crawford, 2016 ;Shilton & Sayles, 2016 ). When users post to social media, they create data that can be mined by researchers using computational methods rather than more conventional social science research methods like interviews, surveys, ethnographic observation, or close reading of texts (Bruns, 2013 ). While social media data is often publicly available, social media users may not understand that their posts are being collected and used for research purposes. Moreover, social media users may not intend for their posts to reach beyond their online community.
Some researchers have experienced negative reactions when publishing social media data without proper protections to subjects. The "Tastes, Ties, and Time" dataset (Lewis et al., 2008 ), comprised of Facebook user data and published on Harvard's Dataverse, was ultimately taken down due to privacy concerns (Zimmer, 2010 ). In 2016, when an Aarhus University graduate student scraped the online dating website OkCupid and released the data using the Open Science Framework, the public response was swift and critical (Markham, 2016 ); the dataset was subsequently taken down. To avoid such backlash and to protect human subjects, the data curation community needs better documentation and guidelines surrounding what Anatoliy Gruzd calls "social media data stewardship" ( 2016 ).
The Society of American Archivists ( 2016 ) and the Council on Library and Information Resources (Besek, 2003 ) have both released resources to guide ethical practice for digital archives in general, and the Social Media Archiving Toolkit from North Carolina State University provides ethical and legal guidelines for social media archives in particular ( 2014 ). Mannheimer, Young, and Rossmann ( 2016 ) propose an ethical framework for researchers using social media data, structured around three points: (1) context, including social media platform and disciplinary norms in the researchers' fields; (2) expectation of social media users; and (3) a value analysis that weighs the benefits of the research against the potential privacy risks to users. Weller and Kinder-Kurlanda's ( 2016 ) framework for sharing social media data is an excellent resource aimed at social media researchers. However, the literature does not yet include ethical guidelines tailored specifically to data curators.

IDCC17 | Practice Paper
The open data movement operates under the belief that open data is a common good, and data sharing is becoming more widespread, encouraged in large part by funding agency and journal policies. However, there remains a lack of clarity about human subject privacy for data that lies outside the traditional realm of Institutional Review Boards. In particular, sharing social media data presents unique challenges regarding sensitive topics, transparency of documentation, user privacy expectations, and social media platform policies. This paper introduces the STEP (Sensitivity, Transparency, Expectation of privacy, Platform) Framework, designed to help data curators in open access repositories operate within these gray areas, balancing the benefits of open data with the potential risks to social media users. Two case studies from the Dryad Digital Repository serve to demonstrate implementation of the STEP framework.

Social Media Data in the Dryad Digital Repository
The Dryad Digital Repository is a useful point of reference for exploring the ethics of data sharing. Dryad is a general purpose repository that provides unrestricted access to data. Dryad content includes openly published datasets associated with social media research, including data collected from Twitter, Facebook, Instagram, YouTube and Flickr. Dryad submitters are responsible for aligning the content of their data publications with Dryad's policies (Dryad TOS, 2016 ), which state that "human subject data must be properly anonymized and prepared under applicable legal and ethical guidelines" (Dryad FAQ, 2016 ). In addition, Dryad's curation team reviews datasets prior to publication and assists researchers in achieving a level of subject anonymity that can be considered "safe." An increasing number and diversity of submissions of this type have highlighted the need for a framework to help structure curator inquiry around ethical publishing of social media data.

Guiding Principles and STEP Framework
The STEP Framework helps guide curators through ethical inquiry when assessing social media data for the purpose of open archiving. While some repositories (e.g. ICPSR , Qualitative Data Repository , and UK Data Service ) can provide restricted 1 2 3 access for sensitive data, this framework focuses on curating fully open access data. The STEP Framework aims to help curators think through ethical challenges regarding social media data, with the ultimate goal of encouraging open data sharing for social media researchers.

Guiding Principles
The framework operates under three high-level principles:

IDCC17 | Practice Paper
• Value analysis. When sharing social media data, researchers and data curators must measure the benefits of sharing data against the potential risks to human subjects.
• Responsibility. Data curators can help educate researchers about ethical data sharing, but researchers themselves are ultimately responsible for the data they share.
• Continual inquiry. Ethical practice requires ongoing dialogue and examination.
Principle 1: Value analysis Open data and user privacy are both ethical imperatives. But data sharing and research reproducibility may stand at odds with ethical and legal concerns regarding social media data. In many cases, as more privacy measures are implemented, social media data becomes less fit for confirming reproducibility (Weller & Kinder-Kurlanda, 2016 ). As the UKAN anonymisation decision making framework suggests, "zero risk is not a realistic possibility if you are to produce useful data" (Elliot, Mackey, O'Hara, & Tudor, 2016 , p. 5). When sharing social media data, researchers and curators must therefore strike a balance between data openness and user privacy.

Principle 2: Responsibility
Data curation and data review are key quality-control elements in the data publication process. Curators should err on the side of caution when curating social media data and other ethically-complex data-even if the research was conducted ethically, curators can't assume that researchers or IRBs have considered the ethical implications specific to sharing data. When necessary, curators should contact researchers to request better data documentation or de-identification of variables. However, data curators cannot be expected to be ethics experts. Mistakes will happen, and even good faith efforts can fall short (Zimmer, 2010 ). While data curators and data repositories have a role to play in educating researchers and promoting ethical data publishing, it is ultimately the responsibility of researchers to ensure that their data is shared ethically.

Principle 3: Continual inquiry
The Society of American Archivists' Code of Ethics encourages archivists to "consult with colleagues, relevant professionals, and communities of interest to ensure that diverse perspectives inform their actions and decisions" ( 2012 ). Data curators-as archivists of research data-will benefit from similar practice. Data curators should consult within their curation team and discuss details with researchers who submit data. Data curators may also benefit from consulting with data curators at other repositories and reaching out to professionals in data-and ethics-related fields. Policies and practices surrounding social media data are very much in flux, and will likely remain so. Continually discussing and reevaluating ethical standards will help curators stay up-to-date with ethical norms.

The STEP Framework for Data Curators
The STEP framework is structured around four key areas of inquiry for data curators: Sensitivity, Transparency, Expectation of privacy, and Platform (STEP) (see Figure 1). This framework is not meant to provide hard and fast rules, but rather aims to improve practice and manage risk for data repositories, researchers, and social media users. Sensitivity Social media data relating to sensitive topics or collected from vulnerable populations requires that data curators examine the data with a particular focus on potential risks to users.

Sensitive topics
Sensitive topics require increased vigilance regarding privacy and anonymity. Lee and Renzetti suggest four areas in which research is likely to be threatening to subjects: (1) when research intrudes into the private sphere or delves into some deeply personal experience; (2) when the study is concerned with deviance or social control; (3) when the study impinges on the vested interests of powerful persons or the exercise of coercion or domination; (4) when the research deals with things that are sacred to those being studied that they do not wish profaned ( 1993 ).

IDCC17 | Practice Paper
Vulnerable populations Research data collected from vulnerable populations who are susceptible to exploitation should also be considered sensitive (Belmont Report, 1979 ;World Medical Association, 2008 ). Mechanic and Tanner suggest that subject vulnerability can result from "developmental problems, personal incapacities, disadvantaged social status, inadequacy of interpersonal networks and supports, degraded neighborhoods and environments, and the complex interactions of these factors over the life course" ( 2007 ). Vulnerable populations have less power in the research process and less power over what happens to their data. Researchers and data curators therefore take on more responsibility regarding data privacy (Elliot, Mackey, O'Hara, & Tudor, 2016 ). When dealing with social media data, this aspect arises most often with regard to minors, who tend to be active users of social networking sites and have different privacy expectations than adults (boyd, 2014 ).

Transparency
Transparent data documentation facilitates ethical data sharing and ethical data reuse. For researchers, transparency includes clearly documenting the data collection methodology, anonymization processes, and ethical considerations, as well as providing ReadMe files or codebooks that help others understand the data being shared. Curators should encourage researchers to include documentation as part of their data publication. When researchers are transparent about their process, they support a culture of openness, facilitate data reuse, and help educate other researchers about methods for ethical data sharing. Further, Rivers and Lewis suggest that transparency regarding social media research can help foster 'privacy literacy' so that the users can make informed decisions about participating ( 2014 ). Ideally, curators should also clearly document their own decisions and activities over the course of reviewing and publishing the data.

Expectation of privacy
In the context of online social networks, the public and the private become intertwined (Zimmer, 2010 ;Rivers & Lewis, 2014 ). While social media posts are available in public forums, social media users may not expect that their posts are being seen beyond their perceived online community (boyd, 2014 ). As Zimmer writes, "just because personal information is made available in some fashion on a social network, does not mean it is fair game for capture and release to all" ( 2010 , p. 323).
Each social media platform functions differently with regard to privacy. Some social media platforms-such as Facebook-are "closed networks," with customizable privacy settings. Other platforms-such as Twitter-are publicly-visible by default. (Twitter users may opt to protect their accounts, limiting access to a select group of followers, but few do so .) Many social media sites support hashtags, which reach a broader IDCC17 | Practice Paper audience, and @-mentions, which address specific users. Some social media sites allow pseudonyms , while others require real names . Some social media platforms provide 5 6 easy access to user data, which encourages data collection and research . All of these 7 platform-specific usage norms can affect a user's expectation of privacy. Regardless of platform, user expectations are key to determining the sensitivity of the data-lower user expectation of privacy makes the data less sensitive. Politicians, celebrities, or organizations likely expect that their social media posts will be read by a wide audience. Private citizens, on the other hand, may not expect that their posts will be viewed by audiences beyond her immediate social network. For example, when Freelon, McIlwain, and Clark collected Twitter data documenting the Black Lives Matter movement, they attempted to honor users' expectations of privacy by publishing only Tweets that had been widely shared, from Twitter users with a large number of followers ( 2016 ). While strategies like these are helpful, there will always be ambiguity in determining user intention, and user expectations may change over time. The most unambiguous method for aligning research with user expectations is to obtain informed consent.

Informed consent
Curators should consider whether and how consent was obtained for the research before archiving social media data. The literature is split regarding the level of consent necessary for social media research. Rivers and Lewis ( 2014 ) assert that informed consent must be granted by each social media user whose posts are used for research purposes, suggesting that researchers "avoid qualitatively analyzing [social media] communications as if they are offered for research consumption without consent, because it does not align with the context in which the tweets were created." Elliot, Mackey, O'Hara, and Tudor ( 2016 ) are more lenient, writing that, "given the current state of the information society, [obtaining informed consent] is both impractical and undesirable" (p. 63); they suggest that a lack of informed consent for data-driven research does not necessarily preclude sharing, but merely makes data more sensitive. Hutton and Henderson ( 2015 ) suggest a model that applies Nissenbaum's theory of contextual integrity, which states that people have "a right to live in a world in which [their] expectations about the flow of personal information are, for the most part, met" ( 2009 ). In their study of Facebook users, Hutton and Henderson used pop-up messages to evaluate participants' willingness to share certain types of data, thus tailoring informed consent to each user's expectations of privacy on Facebook ( 2015 ). The conversation surrounding informed consent will likely continue to evolve; data curators should stay abreast of the latest developments to inform dataset review.

Anonymization
Most open data repositories require that data be de-identified prior to submission (Dryad FAQ, 2016 ;ICPSR, 2012

IDCC17 | Practice Paper
from UKAN (Elliot, Mackey, O'Hara, & Tudor, 2016 ) provides detailed guidance that-although targeted at researchers-can also be helpful to data curators as they review data for publication. Social media data can be very difficult to anonymize (Zimmer, 2010 ). However, anonymization may not be strictly necessary with social media data, depending on social media users' expectation of privacy. As noted in Principle 1: Value Analysis, curators should consult with data submitters to weigh the benefits of publishing the data against the risk that data that could be re-identified. And as noted in Principle 2: Responsibility, while curators review social media data to the best of their knowledge for de-identification issues, the ultimate responsibility falls on the data submitter.
Platform Social media data is hosted by social media sites, each of which has unique privacy policies, terms of service, and developer agreements (Thomson, 2016 ). Some social media platforms' terms of service limit what content can be published. For example, Twitter's Developer Agreement and Policy states that developers who use their API "will only distribute or allow download of Tweet IDs and/or User IDs" (Twitter, 2016 ). Some researchers (Summers, 2014 ;Freelon, McIlwain, & Clark, 2016 ) have published only Tweet IDs, not only to align with Twitter's policy, but also as a strategy to honor the intent of Twitter users. Since Twitter allows users to adjust their privacy settings at any time, users may delete posts or adjust privacy settings to limit the accessibility of posts. Published Twitter data should reflect the privacy choices of Twitter users.
Each social media platform's policies include different rules for data sharing. The Social Media Data Stewardship project's Application Programming Interface and Terms of Service wiki provides an overview of social media platform policies, and may be a helpful resource for curators. The wiki claims to provide "a one-stop-shop for finding relevant information about what you can and can't do with the social media data" (Social Media Data Stewardship, 2017 ). However, terms and policies change over time, and some academic researchers choose to sidestep platform policies if they consider the benefit of their research to be worth the risk of violating terms of service (Kelley, Sleeper, & Cranshaw, 2013 ). Weller and Kinder-Kurlanda ( 2016 ) suggest establishing a dialogue between social media companies, researchers, and data repositories-ultimately aiming to "establish feasible interpretations [of terms of service] that allow researchers to at least share data for the sake of quality control and reproducibility" (169).
Ultimately, curators should aim to be aware of platform policies, but should take into consideration Principle 2: Responsibility-researchers are ultimately responsible for the data they collect and publish.

Case Studies
Two case studies from Dryad-both of which deal with Twitter data-provide examples of using the STEP framework to review research data for publication.

IDCC17 | Practice Paper
Case Study 1 -Data from: The topology of a discussion: the #occupy case Gargiulo, Bindi, and Apolloni ( 2015a ) used Twitter hashtag data to study the evolution of political discussion during and after the Occupy Wall Street movement. The associated Dryad data package (Gargiulo, Bindi, & Apolloni, 2015b ) includes one .csv file containing three variables: "user," "hashtag," and "time." • Sensitivity : The research deals with active and public participation in a social movement and does not focus on a particular population.
• Transparency : No documentation is provided with the data package. Some information about data collection is included in the associated article, but details are insufficient and the content of the .csv file is not adequately explained. The method of analysis is laid out in detail in the article, which would hypothetically allow others to reproduce the results.
• Expectation of privacy : The use of hashtags on Twitter generally indicates one's desire to participate in a larger conversation and/or be identified with a concept or cause. The sample size is large (more than 37,000 users), and the risk of contributors being identified from the contents of the data file is low..
• Platform policy compliance : The .csv file contains a user ID which is described in the article as being "anonymous," but there is no explanation of how this was derived. The file also contains the actual hashtags used, and Twitter policies are unclear on whether this information can be distributed to third parties.
Conclusion : Given the low sensitivity of the research, the public nature of the discussion (and the platform) and the fact that an attempt was made to anonymize the data, the Dryad team concurred that this data package could be safely shared. Better documentation would have made the methods more reproducible and the data more useful, however, the article provides a good model for similar network analyses of social movements. It is unclear whether the publication of this data strictly follows Twitter's policies, but any non-compliance is the responsibility of the authors.

Case Study 2 -Data from: In the mood: The dynamics of collective sentiments on Twitter
This Dryad data package (Charlton, Singleton, & Greetham, 2016b ) and its associated article (Charlton, Singleton, & Greetham, 2016a ) present a study of the relationship between UK Twitter users' "sentiment levels" and the network structure created by @-mentions. Based on statistical analysis of Twitter data, the researchers selected 18 "communities" to monitor and used these to formulate a model for "reproducing measures of emotive response." The data package contains several dynamic mention networks split over 6 tables; variables include an anonymised tweet ID, anonymised user IDs, and timestamps of tweets.

IDCC17 | Practice Paper
• Sensitivity : Topics being discussed by the selected communities are wide-ranging -from "friends chatting" and "dogs" to "Islam versus atheism," "Gamergate" and "smoking/e-cigarettes." • Transparency : The authors provided a ReadMe with the data package that explains the content of each file, and a section of the article describes in detail how the data were obtained.
• Expectation of privacy : The authors assert that tweets with @-mentions are public and may be read and commented on by any other user. They also argue that ethical approval was unnecessary for their research because "the human data … analysed is in the public domain." However, the use of @-mentioning indicates communications intended for specific people, and implies an expectation of discussion within the user's specific network. While the tweet IDs and user IDs provided in the data package were anonymized, exact timestamps present a potential (though low) risk for re-identification.
• Platform policy compliance : Twitter policies are unclear on whether timestamps may be distributed to third parties.
Conclusion : This case is an interesting one in terms of user expectations when engaged in what some might consider a "private" conversation on a public platform. Some of the topics being discussed are sensitive, and many of those participating probably did not consider that their comments would be 1) broadcast to an audience beyond their immediate network and 2) collected and analyzed by researchers. Taking a hard line on informed consent, this study would likely not pass muster. However, given the fact that IDs were anonymized and that the research was presented in a transparent and reproducible way, the benefits of data publication were deemed greater than the risks.

Future Work
The STEP framework should evolve over time. Future work could expand upon the current discussion of the theory and concepts involved in evaluating social media data for publication. In addition, the STEP framework would be strengthened by additional case studies examining social media data from a wide variety of repositories and social media platforms; additional case studies will help demonstrate expanded applicability for the framework. The authors also see a need for additional guidance that can complement the STEP framework's focus on social media data. The data sharing community will benefit from expanded frameworks that apply to general big data research, including social science data journalism.

Conclusion
Sharing social media data helps ensure research reproducibility, advances science, and encourages research efficiency (Weller & Kinder-Kurlanda, 2016 ). Data sharing also facilitates equity of data access, narrowing the divide between the "big data rich" and the "big data poor" (boyd & Crawford, 2012 ;Metzler, Kim, Allum, & Denman, 2016 ). The STEP framework encourages open data for the public good by providing curators with guidelines for assessing data submissions according to Sensitivity, Transparency, Expectation of privacy, and Platform. Curators using the framework are encouraged to think critically and carefully when reviewing social media data for publication, taking into consideration the three guiding principles of the framework: Value Analysis, Responsibility, and Continual Inquiry. Curators must continue to stay informed about social media research practice, and should keep an active dialogue with researchers, other data curators, archivists and librarians, and ethicists. Due to the quickly-evolving nature of the field, the authors envision the STEP framework as just one important "step" along the path to achieving safe, ethical, and reproducible social media research practice.