Locking the virtual filing cabinet: A researcher's guide to Internet data security

https://doi.org/10.1016/j.ijinfomgt.2012.01.005Get rights and content

Abstract

As the Internet has grown in popularity, the opportunity it provides for conducting research has also become too large for researchers to ignore. Thus many have flocked to put surveys, experiments, and other data collection measures online in an attempt to gather empirical evidence in a variety of fields. While some choose a commercial provider to host surveys or experiments, others require the additional flexibility that comes with creating and maintaining a custom server. Herein lies a crucial problem: Most researchers lack the skills necessary to design, implement, and manage a server end-to-end. To overcome this limitation, they often hire programmers and administrators, who while usually competent, are not ultimately accountable to granting agencies funding research or Institutional Review Boards overseeing the research. This puts the researcher in a position of accountability in data security, confidentiality, and privacy concerns. The goal of the present paper is to outline a typical server setup, and highlight issues pertaining to data security in language accessible to researcher. This paper also presents data collected and analyzed from an anonymous distributed survey asking researchers questions assessing their management of research data. Our findings from the responses confirmed the legitimacy of our concerns by demonstrating the evident need for stricter security measures in research settings. By reading this paper, we hope that researchers will be cognizant of the optimal security practices which can be equipped to avoid the extreme consequences of data security breaches and gain a deeper understanding of the software they use to collect research data.

Highlights

► Researchers are often unaware of how data can be lost or stolen from Internet-connected servers. ► Many researchers do not have the time or ability to read complex technical manuals or documentation, and thus do not implement basic security practices (i.e. strong, non-shared passwords). ► Ignorance of data security leading to a breech can have serious consequences including removal of grant funding and IRB inquiries.

Introduction

The role of research within science has always been to provide empirical observations that drive both theory creation and subsequent investigation. Today, there are methods available to collect these observations that would have been just a dream ten years ago. For example, complex surveys and cognitive tasks can now be bundled together (informally called a ‘wave’) and administered electronically. Participants receive invitations through electronic mail or other mediums such as instant message or social networking websites. They complete the wave and are compensated electronically within 2–3 weeks. To participants, this is often viewed as simply a quick way to make a bit of extra money. To researchers, however, this data is considerably more valuable—what once took months to collect in person can now take hours or even minutes to collect and what once may have taken a week of fulltime work to code and enter into a spreadsheet (before analyses even began) can now be done concurrently with data collection. Truly, this use of Internet and computer technology to conduct research is revolutionizing social science, opening up access to more diverse samples than typically used in the past.

However, with this new method of data collection and storage comes new concerns regarding the security of that data. In the past, a principle investigator (PI) had very clear guidelines on the storage of data. Data, especially if they contained personally-identifiable information, were kept literally under lock and key—data collection closets with locked filing cabinets were a common sight. Today, however, the need for physical security is dwarfed by the need for computer data security. Unfortunately, many researchers seem unclear on how exactly to achieve that security.

In this paper, we hope to achieve the following goals: (1) demonstrating the need for higher data security measures deduced from recent events covered in media and our analyses from the anonymous online survey we conducted amongst a group of researchers (2) providing a detailed step-by-step overview of online behavioral research design and execution from the initial planning stages of the study to the collection of data to the preservation of the finished analyses and (3) elucidating on how security can be ensured at every stage of the scientific endeavor. After reading our paper, researchers should feel more equipped to conduct Internet-based research, placing their labs and themselves on the exciting new frontier of social science.

Section snippets

Evidence from recent events

Much recent media attention has been paid to the loss of confidential and sensitive information by various companies, often resulting in negative reputational and financial consequences for the businesses, not to mention the potential identity theft suffered by the individuals that had entrusted these firms with their information. For example, health insurance company Blue Cross Blue Shield recently revealed that it lost identifying information, including tax identification and Social Security

Overview: how are online experiments conducted?

To understand methods of securing data, one must first understand how data are collected. While there are a variety of software packages available to collect data online, the authors, intending this overview to serve as an introduction to researchers who may be relatively new to Internet-based research, will focus on general terms and the most common packages in use by social science researchers today.

In general, online research is conducted on a central computer system referred to as a server.

Building an online experiment

Using the tools listed above, a typical online experiment in the AMP model is designed in the following way:

  • 1.

    A programmer creates a series of files that will be served by Apache. This may include third-party software packages or be completely self-coded. The programmer also uses PHP to write responses from these files to the MySQL database. She then tests the software extensively on her “development” server (which may simply be her laptop running XAMPP).

  • 2.

    Once satisfied with the product, the

From a programmer's perspective: security during experiment creation

Analysis from our data indicated that many of the researchers we inquired sought additional help on technical matters by employing programmers or allotting some of the database control and access to departmental IT staff members (Table 1, Table 2). Thus while discussing in more technical details the experiment creation process, we will also talk about the recommended assumed responsibilities as well as advices specific to programmers working specifically on executing a social science research

Security during testing

At the Center for Decision Sciences located within the Columbia Business School, the term ‘wave’ is informally used to refer to a variety of tasks (sometimes from different researchers) that have been chained together to be presented to participants in particular sequences. The wave typically includes several surveys, cognitive assessments, and decision-making games. Before a full wave is run after inviting many members of our online pool to participate, several smaller waves are typically done

Security during subject participation

While the hard work of creating and testing a script may be over once the invitations go out to participants, it is more important than ever to monitor the security of a server while multiple external users are accessing it. By broadcasting your server's address to others, you not only tell them that something lives there, you indicate it is something important.

To give a real-world example that occurs frequently, let us consider compensated surveys. Individuals are generally allowed to

Security during data analyses

Once data collection has been completed and the experiment has been taken offline (either by removing the files, turning off the web server, or redirecting people to a “Sorry, this survey has been completed” page), next comes the task of analyzing and reporting the data. During this step, security is much more focused on how the data you have collected is stored and maintained, and how archive copies are kept. Below we will cover database security in MySQL, and considerations for data security

Other general security advice

While many topics have been covered above as part of the normal operations of an online laboratory, a few items still warrant attention from security-minded researchers. The following security concerns are those that, though they may not come up during the study itself, are important for maintaining a secure environment for and ensuring the overall integrity of your research.

Summary

In this modern age, many social sciences researchers have become interested in Internet-based research, and some may utilize it as their primary source of data collection, as confirmed by analysis of our survey responses. Concomitant with one's adoption of this new method of data collection is the need for learning and adopting security measures appropriate to each stage of implementation. In fact, it is the authors’ belief that a study is only as successful as its security measures.

To that

Acknowledgements

We would like to thank the attendees of the 2010 Annual Meeting of the Society for Judgment and Decision Making, the members of the Center for Decision Sciences, and the Preferences as Memories Lab Group for their feedback. We would also like to acknowledge the input of Janak Parekh, Jason Dunn, Vincent Ferrari, Eric J. Johnson, and Elke U. Weber. Finally we are indebted to the researchers who responded to the survey included in this paper, our reviewers for their helpful and informative

Jonathan Westfall is the Associate Director for Research & Technology of the Center for Decision Sciences at Columbia Business School.

References (25)

  • About MySQL. Retrieved December 7, 2009, from...
  • Ashenfelter, J. P. (2000). YourSQL Database Might Just Be MySQL. Retrieved December 7, 2009, from...
  • Claburn, T. (2007). Google Releases Improved MySQL Code. Retrieved December 7, 2009, from...
  • Claburn, T. (2009). Laptop Theft Nets Data on 800,000 Doctors. Retrieved November 18, 2009, from...
  • Coueignoux, S. (2009). Laptop Stolen from Halifax Health Employees’ Car. Retrieved November 18, 2009, from...
  • Credit card data leak may affect 230,000: Alico. (2009). Retrieved December 7, 2009, from...
  • Dissanayake, R. (2008). Security Issues in FTP. Retrieved December 17, 2009, from...
  • Dunne, P. Setting up a multi-user system with Linux. Retrieved December 17, 2009, from...
  • Gilbert, H. (2005). Introduction to TCP/IP. Retrieved December 22, 2009, from...
  • Gite, V. (2009). MySQL: Change root Password. Retrieved December 17, 2009, from...
  • IANA. IANA – Country-code Top-Level Domains. Retrieved December 17, 2009, from...
  • John the Ripper password cracker. (2009). Retrieved December 17, 2009, from...
  • Cited by (0)

    Jonathan Westfall is the Associate Director for Research & Technology of the Center for Decision Sciences at Columbia Business School.

    Cindy Kim is the Research Coordinator of the Center for Decision Sciences.

    Annie Ma is a research affiliate of the Center for Decision Sciences. She is a strategist at Google, where she develops enterprise productivity solutions.

    View full text