Rapid storage and retrieval of genomic intervals from a relational database system using nested containment lists

Efficient storage and retrieval of genomic annotations based on range intervals is necessary, given the amount of data produced by next-generation sequencing studies. The indexing strategies of relational database systems (such as MySQL) greatly inhibit their use in genomic annotation tasks. This has led to the development of stand-alone applications that are dependent on flat-file libraries. In this work, we introduce MyNCList, an implementation of the NCList data structure within a MySQL database. MyNCList enables the storage, update and rapid retrieval of genomic annotations from the convenience of a relational database system. Range-based annotations of 1 million variants are retrieved in under a minute, making this approach feasible for whole-genome annotation tasks. Database URL: https://github.com/bushlab/mynclist


Introduction
A typical genomic annotation is represented in the form of an interval (i.e. a range of base-pair positions), such as the boundaries of a gene. Millions of interval-based genomic annotations are available from multiple online resources, most notably the UCSC genome browser, which provides genomic elements in BED file format (chromosome, start, stop and label). Although this format is convenient for browsing a single genomic region or base-pair position through an online genome browser, large collections of genomic intervals are difficult to rapidly search. A common task in genomics is to search interval data (such as a collection of BED files) to identify a set of intervalbased annotations that overlap with a target interval. For example, a user may want to identify genomic elements located within a deleted region. The main challenge for these types of interval-based queries is maintaining a sorted order of intervals to facilitate rapid searches-a task known as indexing within database management systems (DBMS).
Indexing is complicated by nested intervals, in which one interval occurs entirely within the boundaries of a second interval. If no nested intervals are present, sorting intervals by their start position also properly sorts their end positions. Relational database indexing strategies, which assume a single key, can thus properly order non-nested intervals. However, when a nested interval occurs, sorting based on start position no longer guarantees that the end positions will be properly ordered, as shown in Figure 1. To better understand the consequences of this problem, consider the more familiar task of sorting date intervals.
may stay for only a short time, whereas others spend their entire careers with the same team. As such, one player's career with that team may be entirely nested within another's, illustrated in Figure 1. Now consider the question: who played for this team in year 2008? If player careers were non-nested and ordered (as for players A-E), we could identify the last player to join in 2008, and then work our way backward through the sorted list until we found the first player (C) who quit before 2008. However, consider a second set of players (L-P), if the team membership of some players is nested within the membership of other players, there could be a player (M) who joined before player N but stayed with the team after 2008. Because we can draw no conclusions about the ordering of when players quit, we must scan all players who joined before year 2008. In the worst case, we ask about the current year, requiring us to scan everyone who ever played for the team.
In aggregate, table scans like these become very slow, reducing the feasibility of relational databases for genomic annotation tasks (1). An interval-based data structure called Nested Containment List (NCList) was developed to address this issue (2). The NCList data structure and associated algorithms were released as part of a Python graph database library, pygr. This implementation achieved 5-500-fold faster query times than other DBMS-style indexing methods available at the time. Since publication, the NCList structure has been used for sequence alignment using UCSC genome alignments (3,4), processing ChIP-seq and ChIP-chip data (5), and has been incorporated into the popular JBrowse genome browser (6).
The NCList data structure achieves better query performance by hierarchically organizing all nested intervals. This guarantees that each search space consists of contiguous non-nested intervals (i.e. the search space is ordered both on start and stop). Each interval points to a sublist of all completely contained intervals. The query algorithm follows a recursive path that returns all overlapping intervals within all contained sublists. The time complexity of this type of query is O(MlogN), where M is the depth of the tree and N is the database size.

Implementation
Construction of the NCList data structure consists of a Python script that accepts interval annotations in the BED file format. Intervals are processed and placed into a nested containment tree. After tree generation, the script prepares the MySQL data structure and stores the data into node, edge and masterkey tables. The masterkey table uses a numerical identifier to link each interval stored in the node table to its alphanumeric label from the original BED file. Because the intervals in the node table are hierarchical, the edge table is used to link parent intervals to their nested child intervals. The node table contains the id, start and stop positions of all intervals in the structure and a sublist (sub) value. All intervals contained by the same parent are assigned the same sub value, thus grouping intervals in a hierarchical relationship. The base intervals (i.e. those intervals that do not fit within any other interval) are assigned to sub 0, the root of the tree. When a range may fit within multiple partially overlapping intervals, the child interval is assigned based on a user-specified parameter; the default is to assign the interval to the first parent in the sorted order. The tree construction algorithm begins by sorting all intervals by their start positions. Construction of the nested containment tree is performed in a single pass.
In-place updates are accomplished using two stored procedures (addition and removal of intervals) in MySQL. To update the coordinates of an interval, it must be removed and reinserted with the new start and stop positions. To increase the speed of these procedures, we create two binary tree indices on the node table; one on the sub, start and stop position of each range, and one on the start and stop values only. The edge table is indexed by interval id and the sub values.
The deletion mechanism is less complex, as two interval types (A and D) can be deleted without any alterations. Interval E requires the entry in the edge table pointing from its parent be removed. The children of interval B are reassigned to sub 0 and the entry in the edge table removed. Interval C is similar, here the children are assigned the sub value of interval C, and the entry in the edge table is removed.

Results and discussion
We evaluated the performance of the MyNCList process using a combined set of multiple gene exons with 973 000 intervals. Construction of the Nested Containment Graph completes within seconds, but does require the BED file to be entirely loaded into memory. The insert mechanism for this database adds 400 intervals per minute, whereas the delete mechanism removes 2800 intervals per minute. The scalability of these operations is typical of other tree-based data structures.
We compared the query performance of MyNCList with two other popular indexing strategies; a partition-based index and a start/stop position multi-index. We generated tables containing random sets of single base-pair positions in increasing orders of magnitude (1000, 10 000, 100 000 and 1 000 000) drawn uniformly across the human genome to simulate variable sequence positions. We then annotated these positions using a test database derived from exons extracted from the Ensembl, Aceview and UCSC genome databases, equaling 973 000 intervals, reflective of a true annotation task. Simulated variant positions were then joined to the test database using the multi-index strategy, using the partitioning strategy and using the MyNCList stored procedure. All database operations were performed with MySQL query and key cache features disabled. The performance of these methods for our test database is shown in Figure 3. Notably, annotation time for the indexed tables scales linearly with the number of locations, whereas annotation time for MyNCList scales logarithmically. As such, a query of 100 000 positions completes in 9.8 s using MyNCList, but requires several hours to complete with other indexing strategies. A table of 4.7 million positions is annotated in 3.1 min using MyNCList, exceeding the performance of the ANNOVAR, making this a viable method for highthroughput annotation of variants discovered from wholegenome sequence data (1). We also expect these measures to be conservative, as optimization of the MySQL key cache and other features will improve performance.

Availability
The Python script and MySQL package are available for non-commercial research institutions. For full details see http://chgr.mc.vanderbilt.edu/bushlab/.