DNA Data Bank of Japan: 30th anniversary

Abstract The DNA Data Bank of Japan (DDBJ) Center (http://www.ddbj.nig.ac.jp) has been providing public data services for 30 years since 1987. We are collecting nucleotide sequence data and associated biological information from researchers as a member of the International Nucleotide Sequence Database Collaboration (INSDC), in collaboration with the US National Center for Biotechnology Information and the European Bioinformatics Institute. The DDBJ Center also services the Japanese Genotype-phenotype Archive (JGA) with the National Bioscience Database Center to collect genotype and phenotype data of human individuals. Here, we outline our database activities for INSDC and JGA over the past year, and introduce submission, retrieval and analysis services running on our supercomputer system and their recent developments. Furthermore, we highlight our responses to the amended Japanese rules for the protection of personal information and the launch of the DDBJ Group Cloud service for sharing pre-publication data among research groups.


INTRODUCTION
The DNA Data Bank of Japan (DDBJ, http://www.ddbj. nig.ac.jp) (1) is a public database of nucleotide sequences established at the National Institute of Genetics (NIG, https: //www.nig.ac.jp/nig). Since 1987, the DDBJ has been collecting annotated nucleotide sequences as its traditional database service and we held the NIG international symposium commemorating its 30th anniversary in May 2017 (http://www.ddbj.nig.ac.jp/ddbj30th/en). The content of the DDBJ is primarily accumulated via submissions of sequence data by researchers. In addition, the Japan Patent Office and the Korean Intellectual Property Office also contribute sequences from published patent applications. This endeavor has been conducted in collaboration with Gen-Bank (2) at the National Center for Biotechnology Information (NCBI) and with the European Nucleotide Archive (ENA) (3) at the European Bioinformatics Institute (EBI). The collaborative framework is called the International Nucleotide Sequence Database Collaboration (INSDC) (4) and the product database from this framework is called the International Nucleotide Sequence Database (INSD).
Within the INSDC framework, the DDBJ Center also services the DDBJ Sequence Read Archive (DRA) for raw sequencing data and alignment information from highthroughput sequencing platforms (5), BioProject for sequencing project metadata and BioSample for sample information (1,6). The comprehensive resource of nucleotide sequences and associated biological information complies with the INSDC policy that guarantees free and unrestricted access to data archives (7).
In addition to these unrestricted-access databases, the DDBJ Center services a controlled-access database, the Japanese Genotype-phenotype Archive (JGA, http://trace. ddbj.nig.ac.jp/jga), in collaboration with the National Bioscience Database Center (NBDC, https://biosciencedbc.jp/ en) of the Japan Science and Technology Agency (1,8). The JGA stores genotype and phenotype data from individuals who have signed consent agreements authorizing data use only for specific research. The data access is strictly controlled, similar to the data access policy of the database of Genotypes and Phenotypes at the NCBI (9,10) and the European Genome-phenome Archive at the EBI (11). NBDC provides the guidelines and policies for sharing human-derived data (https://humandbs.biosciencedbc. jp/en/guidelines) and also reviews data submission and usage requests.
The DDBJ Center, a part of NIG, is funded as a supercomputing center. Our web services, including submission systems, data retrieval and analytical systems and backend databases, are performed on the NIG supercomputer system. The current commodity-based cluster was implemented in 2012 (12).
In the present article, we report the update of the above services at the DDBJ Center, highlight our responses to the amended Japanese rules for protection of personal information and describe the launch of the DDBJ Group Cloud (DGC) service for sharing pre-publication data among re-search groups. All resources described here are available at http://www.ddbj.nig.ac.jp and most of the archival data can be downloaded at ftp://ftp.ddbj.nig.ac.jp.
From this report, DDBJ periodical release includes not only conventional sequence data but also bulk sequence data, such as Whole Genome Shotgun (WGS) and Transcriptome Shotgun Assembly (TSA). Between June 2016 and May 2017, the DDBJ periodical release increased by 147 437 521 to 874 923 909 in terms of the number of entries and by 572 071 571 206 to 2 461 362 329 556 in terms of the number of base pairs. The periodical release does not include third party data (TPA) records (13). The DDBJ contributed 7.23% of the entries and 3.79% of the total base pairs in the nucleotide sequence data of INSD. A detailed statistical breakdown of the number of records is shown on the DDBJ website (http://www.ddbj.nig.ac.jp/ breakdown stats/prop ent-e.html). Noteworthy large-scale data released from DDBJ are listed in Table 1.
In the period between June 2016 and May 2017, highthroughput sequencing data of 30 418 runs were registered to the DRA. Some of the RIKEN FANTOM5 transcript data (58 runs in total) used to generate a comprehensive atlas of 27 919 human long non-coding RNA genes and expression profiles across 1,829 samples from the major human primary cell types and tissues (14) were released from the DRA (Table 1).

Data contents: the Japanese genotype-phenotype archive (JGA)
The JGA is a permanent archiving service for human genotype and phenotype data (8). Submitters must remove any direct personal identifiers from metadata to be submitted to the JGA. After encrypting the submitted data, the JGA team stores them in the secure database. As of 17 August, 2017, the JGA had archived 104 studies (81 TB) of individual-level human datasets submitted by Japanese researchers. Submission of these studies was reviewed and approved by the Data Access Committee (DAC) at the NBDC. The summaries of 57 studies are available to the public on both the JGA (https://ddbj.nig.ac.jp/ jga/viewer/view/studies) and the NBDC (https://humandbs. biosciencedbc.jp/en/data-use/all-researches) websites. Notable studies available for data access request include 'Standard epigenome mapping in human epithelial cells of the digestive and urogenital organs' (JGA study accession numbers JGAS00000000078-80) submitted by the Japanese team of the International Human Epigenome Consortium (http://crest-ihec.jp/english/index.html) and 'GWAS for atrial fibrillation in the Japanese population' (JGAS00000000114), which is part of the BioBank Japan project that conducted genome-wide association analyses of over 200 000 Japanese participants related to 47 common diseases (15). To access individual-level data of these public studies, users are required to make data access requests to the NBDC (https://humandbs.biosciencedbc.jp/ en/data-use). The DAC at the NBDC ensures that the stated research purposes are compatible with participant consent and that the principal investigator and institution will abide by the NBDC guidelines and the specific terms and conditions imposed for a given dataset. Once access has been granted by the DAC, datasets with access permission can be downloaded with a secure software tool provided by the JGA. It is necessary for users to establish a secure computing facility for local use of the downloaded data according to the NBDC security guidelines.

Responses to the amended rules for protection of personal information
The DDBJ Center handles personal information in compliance with Japanese laws and guidelines. The Act on the Protection of Personal Information (PPI Act, https://www.ppc. go.jp/en/legal) first established in 2003 defines the categories of personal information that should be protected and how this should be achieved. Reflecting information and communication technology developments that have markedly increased the nature and usage of personal information, the PPI Act was amended. The following two amendments have had a major influence on the sharing of personal genotype and phenotype information. (i) Personal whole-genomelevel DNA sequence data are defined as 'individual identification code.' Even if all personal identifiers have been removed from the metadata linked to the whole-genomelevel DNA sequencing data, these data need to be handled as 'personal information' because the DNA sequences are inherently a code that could identify individuals. (2) Personal information including the individual's race and medical history, which require special consideration so as not to cause unfair discrimination or prejudice against the individual, is defined as 'sensitive personal information.' To acquire sensitive personal information and provide it to others, researchers are in principle required to obtain informed consent from research participants. In accordance with the PPI Act amendment, the relevant ministries' ethical guidelines for medical and health research involving human subjects have also been amended. After the enforcement of these amended laws and guidelines on 30 May, 2017, to submit whole-genome-level personal genomic DNA sequencing data to our unrestricted-or controlled-access databases, the submitter needs approval from the NBDC, which checks whether the submission complies with the amended laws and guidelines.

Submission services of biological data
For annotated sequence submission to the traditional DDBJ database, we provide two systems: the Nucleotide Sequence Submission System (NSSS) (16) and the Mass   Accession numbers for reads (submission number) HTC (full length insert cDNA): n/a AK406520-AK407765, AK407767-AK410326, Submission System (MSS) (17). The NSSS is an interactive application to enter all items via a web-based form (http://www.ddbj.nig.ac.jp/sub/websub-e.html). The MSS involves a procedure to send large-scale data files directly (http://www.ddbj.nig.ac.jp/sub/mss flow-e.html). Both systems were enhanced to comply with the new rules of feature and qualifier usages (see http://www.ddbj.nig.ac.jp/insdc/ icm2016-e.html#ft). Submitters can register metadata to BioProject, BioSample and DRA by logging in and using the web interface (https://trace.ddbj.nig.ac.jp/D-way). Human genotype and phenotype data can be submitted to the JGA by using secure upload software.

Retrieval and analysis services of biological data
The DDBJ Center has provided the Web BLAST (18), ClustalW (19,20), vector sequences screening system Vec-Screen (http://ddbj.nig.ac.jp/vecscreen/vecscreen?lang=en) and Taxonomy browser TXSearch (http://ddbj.nig.ac.jp/ tx search) services, which receive requests from web interfaces. The DDBJ Center also provides the Web API for Bioinformatics (WABI) (21-23) for large-scale data analysis and the RESTful Web API service that can process requests from computer programs. The WABI service includes BLAST, VecScreen, ClustalW, MAFFT (24,25), getentry data retrieval system via accession numbers and the ARSA keyword search system for the DDBJ flat files (12). We have semantically represented the DDBJ annotated sequence records into the Resource Description Framework (RDF) in collaboration with the Database Center for Life Science (DBCLS) (1,26,27). In collaboration with EBI ArrayExpress (28), we have also mirrored the public ArrayExpress experiment, array, and Expres-sion Atlas data to our FTP site (ftp://ftp.ddbj.nig.ac.jp/ mirror database/arrayexpress) since December 2016.

DDBJ pipeline
The DDBJ Read Annotation Pipeline (DDBJ Pipeline, https://p.ddbj.nig.ac.jp) is a web service for annotation analysis of high-throughput DNA sequencing reads running on the NIG supercomputer (29). We provide basic analytical functions of de novo assembly and reference sequence alignment using a Graphical User Interface. A de novo assembler, Canu (30), has been added to the pipeline, which can be utilized only for long reads of Oxford Nanopore Technologies sequencers.

The NIG supercomputer
The NIG supercomputer is composed of calculation nodes for general-purpose (554 thin nodes, each with 64 GB memory) and memory-intensive tasks including de novo assembly of sequencing reads (10 medium nodes, each with 2 TB of memory and one fat node with 10 TB of memory). The calculation nodes are interconnected with InfiniBand and the total peak performance of CPUs is 372 Tflops. To support massive I/O in the big-data analysis, the NIG supercomputer is equipped with 7.1 PB of the Lustre parallel distributed file system (http://www.lustre.org). The 5.5 PB MAID (Massive Array of Idle Disks) system is used for archiving large-scale sequencing data of the JGA and INSD's Sequence Read Archive while lowering power consumption (12).
Between June 2016 and May 2017, the number of NIG supercomputer users increased from 2501 to 2951. The criteria for issuing a user login account are shown on the D34 Nucleic Acids Research, 2018, Vol. 46, Database issue web page (https://sc.ddbj.nig.ac.jp/index.php/en/criteriafor-issuing-user-login-accounts). For the convenience of the users, many biological datasets (listed at https://sc.ddbj. nig.ac.jp/index.php/ja-availavle-dbs, Japanese only) and popular bioinformatics tools (listed at https://sc.ddbj.nig. ac.jp/index.php/ja-avail-oss, Japanese only) were installed in the NIG supercomputer system. Since February 2017, we have started a billing system to share costs with users who use large-volume storage and reserve the calculation nodes for new jobs. We expect that we can promote efficient use of our computer resources and increase the sustainability of our system by sharing operating costs with users (https://sc. ddbj.nig.ac.jp/index.php/billing-system, Japanese only).

DDBJ group cloud service for sharing pre-publication data
As the sequencing technologies advance and the amount of genomic data generated grows, it becomes critical to store, analyze and share large-scale data with research collaborators efficiently. To facilitate the sharing and analysis of prepublication data among research groups, the DDBJ Center has operated a cloud-type service DGC on the NIG supercomputer since February 2017. In the DGC databases, users can upload and share their pre-publication data with their research collaborators in the data models which are identical to those of the public databases. Upon publication, users can submit their data by simply transferring the data from the DGC database to the corresponding public one of the DDBJ Center. The DGC hosts the AMED Genome Group Sharing Database (AGD) (http://trace.ddbj.nig.ac. jp/agd/index e.html) as the first use case. In the AGD, researchers funded by the Japan Agency for Medical Research and Development (AMED, http://www.amed.go.jp/en) upload and share their pre-publication raw personal genome sequencing data in the JGA's data model. Because the DGC is not a fully public service, the operating costs are shared with the DGC users.

FUTURE DIRECTION
The ever-increasing volume of personal sequencing data makes it difficult for researchers to prepare their own secure computer resources with sufficient storage and computing power and to transfer large amounts of data online from public databases. To solve these issues, the NBDC certifies qualified secure supercomputer systems as 'Trusted Servers' and allows users to analyze the approved JGA dataset in the Trusted Servers in addition to their own servers. The DDBJ Center will provide the secured NIG supercomputer as a Trusted Server that is connected with the JGA system by a high-speed network, so users can smoothly download the JGA dataset and analyze their own personal genomic data in the same supercomputer.
To increase the discoverability of the JGA-archived human genomes, the DDBJ Center and NBDC collaborate to provide the Global Alliance for Genomics and Health beacon web service (https://beacon-network.org) to accept queries of specific alleles on the human reference genome.
The DDBJ Center has launched the Japan Alliance for Bioscience Information portal site (http://jbioinfo.jp/index. html) in collaboration with NBDC, DBCLS and the Protein Data Bank Japan. We will develop this portal site as a one-stop service of databases and tools that are helpful in various fields of life science research.

ACKNOWLEDGEMENTS
We gratefully acknowledge the support of Koji Watanabe, Chiharu Kawagoe and all members of the DDBJ Center for their assistance in data collection, annotation, release and software development. We thank Masanori Arita for organizing the NIG symposium commemorating the 30th anniversary of DDBJ and for helpful discussions. We are also grateful to Mari T. Minowa, Minae Kawashima, Kazunori Miyazaki and Nobutaka Mitsuhashi of NBDC as collaborators of the JGA project; Yasuhiro Tanizawa, Takako Mochizuki and Shota Morizaki for the DDBJ Pipeline updates; Takatomo Fujisawa and Toshiaki Katayama for validation and semantic representation of INSDC data; Yoshihiro Okuda for taxonomy search; Tazro Ohta of DBCLS and Ryota Yamanaka of Oracle Corporation Japan for the virtual machine collaboration; and Hidemasa Bono of DB-CLS, Amy Tang, Ugis Sarkans and Robert Petryszak of EBI for ArrayExpress data mirroring. We would also like to thank Kento Aida, Shigetoshi Yokoyama and Nobuyoshi Masatani of the National Institute of Information and Shinichi Miura and Satoshi Matsuoka of Tokyo Institute of Technology for establishing the computational infrastructure of the NIG supercomputer.