BIOMEDICAL DATA SHARING, SECURITY AND STANDARDS

The National Institutes of Health (NIH) implemented a policy on data sharing in 2003. The policy reaffirmed the principle that data should be made as widely and freely available as possible while safeguarding the privacy of research participants, and protecting confidential and proprietary data. Restricted availability of unique resources upon which further studies are dependent can impede the advancement of research and the delivery of medical care. Therefore, research data supported with NIH funds should be made readily available for research purposes to qualified individuals within the scientific community. One approach to sharing data is to establish a network of databases. However, there are a number of barriers to creating successful networks, which can include fundamental differences in informatics infrastructure and communication tools used at various research sites. Solutions will entail standards for data collection, processing, and archiving to allow interoperability among the databases and the ability to query data across databases. Open architectures for data collection as well as software to facilitate communication across different databases are needed. An important requirement for sharing data is the protection of the privacy of individuals who participate in the research and the data. Privacy protection hinges on maintaining the confidentiality of the data and the security of the databases. There must be clear policies for data security, which may include data encryption, coding, and establishing limited access or a tiered approach to data access.


INTRODUCTION
Progress in scientific research depends on the free flow of information and ideas. As a matter of policy, the NIH is explicit in stating that "restricted availability of unique resources upon which further studies are dependent can impede the advancement of research and the delivery of medical care". To ensure that future research can build on previous efforts and discoveries, the NIH has developed a data sharing policy that has been in effect since October 1, 2003. The policy expects final research data, especially unique data, from NIH-supported research efforts to be made available to other investigators. In implementing this policy, NIH is cognizant of the need to protect the privacy of individuals and thus data security becomes paramount as one considers means to share data. Successful implementation of this policy is also dependent on technology needs such as software tools and database architecture issues.

DATA SHARING: PRIVACY CONCERNS AND SHARING METHODS
Protecting the privacy of human participants in research studies should be a top priority for any researcher. Investigators, Institutional Review Boards (or bioethics review), and research institutions have an obligation to protect participants' rights and welfare, including individual privacy protection and confidentiality of data. Privacy and confidentiality are particularly important for studies with very small sample size. Steps should be taken to avoid inadvertent identification of participants through deductive approaches when the sample size is small. For example, for a study involving a small community, one might be able to identify a participant based on just a few personal characteristics or attributes, without even knowing the individual's name, address or telephone number. Similarly, there are caveats in sharing data from studies collecting sensitive data. However, even in these situations data sharing is possible without compromising confidentiality, provided that identifiers are removed from data. In addition, data sharing agreements can be used to restrict the transfer of data to others and to specify the appropriate uses for shared data.
Investigators should take into consideration possible restrictions from local, State, and Federal laws, such as the Privacy Rule, a U.S. Federal regulation under the Health Insurance Portability and Accountability Act (HIPAA). The HIPAA privacy rule mandates that an individual's written authorization is required for the use or the disclosure of protected health information unless the requirement is waived by an institution's privacy board. Furthermore, a decedent's information is protected, even though for obvious reason, his/her authorization cannot be obtained. In these situations, the next-of-kin must be contacted for authorization. For research purposes, researchers may obtain authorization to use protected health information if the information is used for a specific research study; HIPAA does not allow for future unspecified research use. It should be noted that authorization may be given to create a repository or a database. Research use of protected health information without authorization is allowed if the protected information has been completely de-identified. De-identification involves removal of all 18 identifiers as defined by HIPAA, or statistically, such that a statistician certifies that there is a very small risk that the information could be used to identify an individual. HIPAA also affords some flexibility to use a limited data set for research purposes under a data use agreement. With such an agreement, limited types of identifiers such as name of city, or state, Zip codes, and elements of dates, can be released with protected health information, but no unique identifiers can be released.
Data sharing can be accomplished through a number of methods. The most common method is publishing articles in scientific publications. Researchers also share data through an informal channel, by responding directly to data requests. However, as the need for data confidentiality increases, the methods of sharing become more stringent. These may take the form of a data enclave involving controlled, secure environments in which eligible researchers can perform analyses using data resources. Alternatively, data archives can be used where machine-readable data are acquired, manipulated, documented and distributed. A combination of these methods can be used to share data, each providing a different level of access.
An example of sharing sensitive research data comes from an NIH-supported survey, questionnaire study, the National Longitudinal Study of Adolescent Health. The study involved more than 20,700 adolescents in grades 7-12, who were followed from 1994 to 2002, as well as their parents and the school administrators. Independent data were collected on the neighborhoods and communities where these schools were located. Measures of health, health-related behaviors such as sexual activity and drug use, as well as determinants of health at the individual, family school, peer group and community level were included in the questionnaire. On the one hand, the challenges to data sharing from this study were the sensitive nature of the responses, the need to protect individual privacy and the danger of deductive disclosure. On the other hand, the benefits and rationale for sharing research data from this study were overwhelming because the scope of the large study made it unlikely to be replicated because of the substantial costs. The potential of learning much more beyond the primary outcome assessment of adolescent behavior and strategies for interventions was significant.
The solution to the risks of data sharing was to develop a multi-tiered system for data sharing: public-use data, contractual data sets and a "cold room" for on-site data use. Public-use data were made available through a data archive managed by a contractor. The public data set contained a small fraction of the entire data, approximately 6,500 cases where identifying information was redacted. Data from small populations that were over-sampled, such as ethnic groups, were not made publicly available. At the next tier of data security, i.e, under a data-use contractual agreement, the full data set was made available to researchers if the IRB approved the data security plan. The datause contract must remain active with a signed agreement and the requesters must agree to cover the costs of providing the data. At the highest level of data security, a cold room was established by the NIH at the site of the grantee institution.

INTEROPERABILITY
Sharing data between databases requires interoperability of data formats, standards, and data types. When these elements are standardized, researchers can access and use heterogeneous information, such as molecular biology, DNA and protein sequence, genomic, proteomic, micro-array, clinical and biomedical imaging, just to name a few.

Data Science Journal, Volume 6, Open Data Issue, 17 June 2007
A major incentive of bioinformatics is to establish agreed-upon standards for file formats and data exchange protocols.
Vocabulary and ontology provide a common set of words and a common context and specific meanings for the words to integrate data across databases and write general purpose software for data processing. Ontology, as a computable, machine-interpretable language to represent biology, can facilitate the semantic interoperability of biological data in different domains such as genomic or clinical studies.
The NIH-funded Gene Ontology Consortium has stimulated the development and adaptation of tools for accessing databases and mapping gene products to ontology terms. These tools enable us to query the databases in a consistent manner, based on a common understanding of the definition of querying terms. Another example of ontology development and application at the NIH is the National Cancer Institute's Cancer Bioinformatics Grid (caBIG). The caBIG vocabularies are based on the NCI thesaurus. Repositories of common data elements provide data standards, including the development, promotion and support of vocabularies, and ontology to ensure that the entire caBIG community is speaking the same "language". The caBIG infrastructure achieves data sharing and interoperability through a federated model, providing a platform for researchers to access a rich collection of biomedical data using informatics tools to integrate diverse data types. Thus caBIG serves as a model for sharing and accessing data in a federated database paradigm. Programming and messaging interfaces (APIs) based on the standard vocabularies, ontology and common data elements permit sharing and exchange of data using a variety of software tools.

OPEN SOURCE AND OPEN ARCHITECTURE IN BIOINFORMATICS
The Open source, as defined in the Open Source Initiative (http://www.opensource.org/docs/definition.php), is not limited to open access to the source code. The definition states that "the distribution license shall not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs from several different sources". The source code as well as the compiled program must be publicly available. One must be able to modify the open-source software and the license must allow derivatives to be distributed under the same terms as the license of the original software. Open-source software libraries also provide fundamental building blocks for algorithms and applications. An example is the Visualization ToolKit (VTK), which is an open source, freely available software system for 3D computer graphics, image processing and visualization. Thousands of researchers and developers around the world have utilized this tool for various purposes.
The open-source philosophy has become mainstream in bioinformatics. The examples described above share a common program feature of being well managed and executed, and result in robust and high-quality bioinformatics products. A flexible open architecture and well-tested, harmonized software codes are some of the critical factors that contribute to the success such open-source projects.