BGVD: An Integrated Database for Bovine Sequencing Variations and Selective Signatures

Next-generation sequencing has yielded a vast amount of cattle genomic data for global characterization of population genetic diversity and identification of genomic regions under natural and artificial selection. However, efficient storage, querying, and visualization of such large datasets remain challenging. Here, we developed a comprehensive database, the Bovine Genome Variation Database (BGVD). It provides six main functionalities: gene search, variation search, genomic signature search, Genome Browser, alignment search tools, and the genome coordinate conversion tool. BGVD contains information on genomic variations comprising ~60.44 M SNPs, ~6.86 M indels, 76,634 CNV regions, and signatures of selective sweeps in 432 samples from modern cattle worldwide. Users can quickly retrieve distribution patterns of these variations for 54 cattle breeds through an interactive source of breed origin map, using a given gene symbol or genomic region for any of the three versions of the bovine reference genomes (ARS-UCD1.2, UMD3.1.1, and Btau 5.0.1). Signals of selection sweep are displayed as Manhattan plots and Genome Browser tracks. To further investigate and visualize the relationships between variants and signatures of selection, the Genome Browser integrates all variations, selection data, and resources, from NCBI, the UCSC Genome Browser, and Animal QTLdb. Collectively, all these features make the BGVD a useful archive for in-depth data mining and analyses of cattle biology and cattle breeding on a global scale. BGVD is publicly available at http://animal.nwsuaf.edu.cn/BosVar.


Introduction
Cattle are usually considered the most economically important livestock. The species numbers more than 1.4 billion on a global scale, constituting some 800 extant cattle breeds in 2016 according to the Food and Agriculture Organization (FAO; http://www.fao.org/home/en/). Cattle are now kept on all inhabited continents, in contrasting climatic zones and under very different conditions [1]. The different uses of cattle and the selection for desired traits have resulted in diverse populations distributed across the world. To meet projected global demands for food, initiatives [2][3][4][5] such as the 1000 Bull Genomes Project are generating resequencing data from breeds worldwide. The DNA-based selection tools built from these data are further accelerating rates of genetic gain and improving animal health and welfare [2]. However, the limited amount of variation data provided by dbSNP [6], restricted access to the 1000 Bull Genomes Project [7], and the existence of only sporadic cattle databases that are specialized in gene and quantitative trait locus (QTL) annotation [8][9][10] considerably hinder the utility of these data. Furthermore, accessing and integrating resequencing data in a highly interactive, user-friendly web interface, especially data for allele frequency resource and selection in natural populations, is a pre-requisite for identifying functional genes. Therefore, building a public data repository is vital for collecting a wide variety of cattle resequencing data and performing integrative, in-depth analyses within the research community.
Here, we develop the Bovine Genome Variation Database (BGVD), the first web-based public database for accessing dense and broadly representative bovine whole-genome variation data. BGVD is a data repository that focuses on single nucleotide polymorphisms (SNPs), indels, copy number variations (CNVs), as well as selective signatures underlying domestication and population bottleneck events. We have implemented a large number of summary statistics informative for the action of selection, such as nucleotide diversity (Pi) [11], heterozygosity (H p ) [12], integrated haplotype score (iHS) [13], Fixation index (F ST ) [14], cross-population extended haplotype homozygosity (XP-EHH) [15], and the cross-population composite likelihood ratio (XP-CLR) [16] (Table 1). Six early differentiated ancestral populations were used for selection analysis, including African taurine, European taurine, Eurasian taurine, East Asian taurine, Chinese indicine, and Indian indicine. The current version of BGVD contains 60,439,391 SNPs, 6,859,056 indels, and 76,634 CNV regions derived from 432 cattle samples representing 54 breeds. With its functional-ities for browsing variations and their selection scores, BGVD provides an important publicly accessible resource to the research community to facilitate breeding research and applications, and offers information on dominant functional loci and targets for genetic improvement through selection.

Database structure and content
The BGVD includes SNPs, indels, CNVs, genomic selection, and other database resources for cattle, such as NCBI, the UCSC Genome Browser, Animal QTLdb, KEGG, and AmiGO2. A detailed description is provided in the following sections and documents on the homepage.

Sample information
Our data set integrates genomes from previously published cattle genetic studies [3][4][5][17][18][19][20][21], providing a total of 432 samples representing 54 breeds. All raw sequence data were obtained from the Sequence Read Archive (SRA) of NCBI. The set of samples is grouped by location of breed origin and contains the following number of individuals: 108 West European, 83 Central-South European, 9 Middle East, 9 Tibetan, 28 Northeast Asian, 47 North-Central Chinese, 21 Northwest Chinese, 33 South Chinese, 24 Indo-Pakistan, and 70 African cattle. Geographic information and other detailed information for each sample is provided on the homepage and the corresponding ''Sample Table" page.

Variant information
Data were processed and loaded into the BGVD using the following pipeline according to previously published protocols [5] ( Figure 1A, see detailed description on the Documentation page at of the website). First, short, 250-bp paired-end Illumina reads were aligned to the Btau 5.0.1 genome assembly (GCF_000003205.7) using BWA [22], resulting in an average of~13Â coverage of the bovine genome among the cattle varieties. Duplicate reads were removed using Picard tools (http:// broadinstitute.github.io/picard/). The Genome Analysis Toolkit (GATK) was used to detect SNPs and indels [23]. A total of~60.4 million autosomal SNPs and~6.8 million autosomal indels were identified. Beagle was used to phase the identified SNPs [24]. Annotation of SNPs and indels was carried out using snpEff [25]. Minor allele frequency (MAF) for all cattle as well as allele frequencies for each breed and the ''core" cattle group (see Population structure section) were calculated with PLINK [26]. CNVcaller [27] was used to discover CNVs, and 76,634 CNV regions (CNVR) were detected in 432 cattle genomes. Then, the CNVs were annotated using Annovar [28]. Given that three versions of the bovine genome, Btau 5.0.1, UMD3.1.1, and the newly-released ARS-UCD1.2 (project accession: NKLS00000000), are commonly used, we produced liftOver chain files (Btau5.0.1ToUMD3.1.1.chain. gz and Btau5.0.1ToARS-UCD1.2.chain.gz) and converted variation coordinates to those of the other two genomes using liftOver [29].

Population structure
The population structure of all cattle was inferred using Eigensoft and ADMIXTURE [30,31], based on the genome-wide unlinked SNP dataset, according to previously published protocols [5]. All 432 individuals were used for principal component analysis, and the results were consistent with our previous results [6], except that the African taurine cattle were split from other taurine cattle ( Figure 1B). To reduce the bias due to sample size, 10 individuals were randomly selected for breeds that had more than 10 samples. A total of 317 cattle samples were selected for estimating ancestral populations by setting K = 2 through K = 8 in ADMIXTURE ( Figure 1C). Combining with our previous results [5], in addition to five geographically distributed ancestral groups (European taurine, Eurasian taurine, East Asian taurine, Chinese indicine, and Indian indicine), African taurine was added in this study ( Figure 1B).

Selection evaluation
BGVD provides signatures of selection for eight groups, six of which were the ''core" cattle groups that we identified as ancestral groups and the other two of which were directly divided into two categories based on sub-species: Bos indicus and Bos taurus. Here, selective signals were evaluated using six methods, namely, Pi, H p , iHS, F ST , XP-EHH, and XP-CLR (Table 1).

Database implementation
The web interface of BGVD was built by combining an Apache web server, PHP language, HTML, JavaScript, and the relational database managements system MySQL.
High-quality SNPs, indels, CNVs, selection scores and their corresponding annotations, classification, and threshold values, were processed with Perl scripts and stored in the MySQL database. The server application was written in PHP, and CodeIgniter was chosen as the model-viewcontroller (MVC) framework for the system. A client interface developed with HTML5 and JavaScript was used to implement search, data visualization, and download. Moreover, we introduced web-based software, such as BLAST, BLAT, liftOver, and the UCSC Genome Browser (hereafter referred to as 'Gbrowse') [29,32], into BGVD. Information including variations, selection scores, gene annotation, QTLs, and phastCons conserved elements of 20-way mammals and 100-way vertebrates was integrated into Gbrowse to facilitate global presentation.

Web interface and usage
BGVD uses a series of user-friendly interfaces to display results. All the parts in our browser are dynamic and interactive. We provided six main functionalities: (i) Gene Quick Search, (ii) Variation Search, (iii) Genomic Selection Search, (iv) Genome Browser, (v) Alignment Search Tools (BLAT/ BLAST), and (vi) Genome Coordinate Conversion Tool (liftOver). For ''Gene Quick Search", we integrated information from NCBI, AmiGO 2, and KEGG. Users can input a gene symbol to view all available information, including basic gene information (e.g., genomic location, transcript and protein profile, relevant Gene Ontology (GO) ID, GO terms, and KEGG pathways), gene variations (e.g., SNPs, indels, and CNVs), as well as selective signatures. We also provide links to Gbrowse and external databases (NCBI, AmiGO 2, and KEGG) to help users obtain more information, such as gene/mRNA/protein sequence, KEGG Orthology (KO), and motif.
For ''Variation Search", BGVD allows users to obtain information on SNPs, indels, and CNVs by searching for a specific gene or a genomic region in three versions of the bovine genome (ARS-UCD1.2, UMD3.1.1, and Btau 5.0.1) ( Figure 2A). Users can filter SNPs and indels further by ''Advanced Search", in which certain parameters ( Figure 2B), such as MAF and consequence type, can be set; this option enables users to narrow down the items of interest in an efficient and intuitive manner.
The results are presented in an interactive table and graph. For SNPs and indels, users can obtain related details including variant position, alleles, MAF, variant effect, rsID, and the allele frequency distribution pattern in 54 cattle breeds worldwide ( Figure 2C) or in six ''core" cattle groups ( Figure 2D), which could help users dynamically visualize breed-specific (exemplified with rs384881761, KRT27) [2] or ancestral groupspecific (exemplified with rs109815800, PLAG1) [33] variants and their global geographical distributions.
For CNVs, users can obtain information about CNVR, such as intersected genomic regions, CNV length, the closest gene, consequence type ( Figure 3A), and copy number distribution in 432 individuals representing 49 cattle populations. We provide three types of display formats of copy number distributions, in which the categories and haploid copy number of each individual can be viewed ( Figure 3B-D), such as the ''view" button, which produces a scatter plot (exemplified with MATN3); ''Gbrowse", which is linked to the ''CNVR Bar" track (exemplified with KIT); and the more detailed visualization ''cnvBar" track, which generates a box-whisker plot (exemplified with CIITA) [34].
In the genomic signature interface, users can select a specific gene symbol or genomic region, one of the statistical methods (Pi, H p , iHS, F ST , XP-CLR, and XP-EHH), and a specific ''core" cattle group to view the selection scores (Table 1 and Figure 4A). In our database, the selection scores are preprocessed by several algorithms (Z-transform and logarithm). The results are retrieved in a tabular format ( Figure 4B). When users click the ''show" button on the table, selective signals are displayed in Manhattan plots or common graphics, where the target region or gene is highlighted in red/blue color. In addition, the ''Gbrowse" button can locate the position of the selection and differentiation profiles of specific groups (Figure 4C). To demonstrate the function of our database, we extracted results for a number of putatively selected genes detected using different methods: OR2T33 [35], POFUT1 (Figure 4B and C), STOM, EPB42 [3], PLAG1 [33], MSRB3 [35], CDC42SE1 [36], R3HDM1 [37], and ASIP [5].
To further investigate the relationship between variations and signatures of selection, Gbrowse has been introduced to support our database. Currently, 57 tracks have been released for the Btau 5.0.1 assembly ( Figure 4E). Users can search with a gene symbol or genomic region to see SNPs, indels, CNVs, genomic signatures, QTLs, and conserved elements in the global view. All search pages in the BGVD allow quick access to Gbrowse to deepen the functional inference of the candidate gene or region by combining other tracks. Most noteworthy, the phased haplotypes from six ''core" cattle groups are displayed in ''SNPs&Hap" track. The 'squish' or 'pack' view highlights local patterns of genetic linkage between variants. In the haplotype sorting display, variants are presented as vertical bars with reference alleles in blue and alternate alleles in red so that local patterns of linkage can be easily discerned Displayed is an example for the KIT gene, which is related to coat color in Herefords. D. The more detailed visualization ''CNVR Bar" track in the format of a box-whisker plot, displaying copy number distribution in 49 cattle populations. Displayed is an example for CIITA, which lies within a high-frequency gain CNVR identified in multiple breeds that show nematode resistance. CNVR, copy number variation region; Gbrowse, UCSC Genome Browser. when clustering is used to visually group co-occurring allele sequences in haplotypes. We display different haplotypes of the Bos taurus and Bos indicus groups in Figure 4C. We highlight that the tracks of selection statistics from different populations are visualized in different colors ( Figure 4D).
We also introduced two sequence alignment tools, webBlat and NCBI BLAST, as well as a genome coordinate conversion tool (liftOver) [29] into the BGVD. The webBlat tool can be used to quickly search for homologous regions of a DNA or mRNA sequence, which can then be displayed in Gbrowse. BLAST can find regions of local similarity between sequences, which can be used to infer functional and evolutionary relationships between sequences. The liftOver tool is used to translate genomic coordinates from one assembly version into another. BGVD provides an online lift from Btau_5.0.1 to UMD_3.1.1 and from Btau_5.0.1 to ARS-UCD1.2.

Discussion
By applying summary statistics to a relatively extensive data set from cattle genomes, we provide a timely and expandable resource for the population genomics research community. An associated user-friendly genome browser gives a representation of the genetic variation in a genomic region of interest and offers functionality for an array of downstream analyses. We expect that BGVD will prove useful for genome mining through the large number of test statistics and the fine- grained character of resequencing data. We believe that this expandable resource will facilitate the interpretation of signals of selection at different temporal, geographical, and genomic scales.