Replicate whole-genome next-generation sequencing data derived from Caucasian donor saliva samples

Next-generation sequencing (NGS) of whole genomes has become more accessible to biomedical researchers as the sequencing price continues to drop, and more laboratories have NGS facilities or have access to a core facility. However, the rapid and robust development of practical bioinformatics pipelines partly depends on convenient access to data for the testing of algorithms. Publicly available data sets constitute a part of this strategy. Here, we provide a triplicate whole-genome paired-end sequencing data set, consisting of 1.38 billion raw sequencing reads derived from saliva DNA from a single anonymous male Caucasian donor, with the average sequencing depths aimed at 30x for two of the samples and 4x for a low-coverage sample. The raw number of single nucleotide variants were 3.3–4 million and the median variant read depth of GATK4-passed variants in three samples was 22, 18, and 10. 81% of all variants were found in two or three of the samples, whereas 19% were singletons. The karyotype was evaluated as 46,XY with no apparent copy-number variation. The data set is provided without restrictions for research, educational or commercial purposes.


a b s t r a c t
Next-generation sequencing (NGS) of whole genomes has become more accessible to biomedical researchers as the sequencing price continues to drop, and more laboratories have NGS facilities or have access to a core facility. However, the rapid and robust development of practical bioinformatics pipelines partly depends on convenient access to data for the testing of algorithms. Publicly available data sets constitute a part of this strategy. Here, we provide a triplicate whole-genome paired-end sequencing data set, consisting of 1.38 billion raw sequencing reads derived from saliva DNA from a single anonymous male Caucasian donor, with the average sequencing depths aimed at 30x for two of the samples and 4x for a low-coverage sample. The raw number of single nucleotide variants were 3.3-4 million and the median variant read depth of GATK4passed variants in three samples was 22, 18, and 10. 81% of all variants were found in two or three of the samples, whereas 19% were singletons. The karyotype was evaluated as 46,XY with no apparent copy-number variation. The data set is provided without restrictions for research, educational or commercial purposes.

Value of the Data
• The data set provided here is relevant for the continued development and testing of bioinformatics pipelines as whole-genome sequencing become more important in biomedical research. • Data access is provided by simple download and without restrictions. The triplicate sequencing of a Caucasian male may benefit bioinformaticians, biomedical researchers for testing or as control samples. The data may also be used for educational purposes. • The raw sequencing data consists of biological replicates of low, medium, and higher coverage, which thus may be used for testing different workflow setups.

Data Description
Here, we provide a data collection of samples derived from saliva DNA from a single anonymous male Caucasian donor consisting of triplicate whole-genome paired-end sequencing reads, with 1.38 billion raw reads in total ( Fig. 1 A), with a mean quality of 36 (SAMtools stats ), and approximately 93.4% paired and mappable reads (GRCh37). The combined theoretical mean coverage was estimated to be 63-67x, depending on whether the unadjusted or mapped percentage was implemented, using the Lander and Watermann approach [1 , 2] for genomic mapping: C = L * N/G (C: coverage, L : read length, N : number of reads, G : genome size). Calculations were based on an average read length of 144, 146, and 148 bp. The median GATK-passed variant read depths of the three were 22, 18, and 10 with 3.3-4 million variants ( Fig. 1 B), thus representing medium, low and shallow depth in the perspective of contemporary WGS coverage. 81% of all GATK4-passed variants (see provided workflow), 4.2 million in total, were found in two or three of the samples and 19% in singletons ( Fig. 1 C). In agreement with previously reported results [3] , the total number of unique single nucleotide variants were approximately 4 million. The karyotype was evaluated as 46,XY with no noticeable copy-number variation (CNV) detected ( Fig. 1 D). We note that the number of variants will vary according to user-specified workflow.
The data set is provided without restrictions for research, educational or commercial purposes. Additional replicates may be added to the repository for future usage. Please cite appropriately.

Ethics Statement
Informed consent was obtained concerning the donation of biological material and genomic information. Sequencing was part of a technology assessment using anonymous donor material and does not involve any clinical evaluations or trials. Data is made freely available in order to contribute to the continued development of NGS bioinformatics and for educational purposes.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have or could be perceived to have influenced the work reported in this article. Funding was provided by the first author. Disclaimer: The provided data presentation is deliberately descriptive. It is not a regular scientific paper .