g_contacts: Fast contact search in bio-molecular ensemble data ✩ , ✩✩

Short-range interatomic interactions govern many bio-molecular processes. Therefore, identifying close interaction partners in ensemble data is an essential task in structural biology and computational biophysics. A contact search can be cast as a typical range search problem for which efficient algorithms have been developed. However, none of those has yet been adapted to the context of macromolecular ensembles, particularly in a molecular dynamics (MD) framework. Here a set-decomposition algorithm is implemented which detects all contacting atoms or residues in maximum O ( N log ( N )) run-time, in contrast to the O ( N 2 ) complexity of a brute-force approach


Introduction
Molecular dynamics (MD) integrators allow simulations of large bio-molecular systems comprising millions of atoms on nanosecond to millisecond time scales [1,2]. These simulations produce a substantial amount of trajectory data, which typically consist of 10 4 -10 6 structure ''snapshots'' (frames). The computational effort to generate the trajectory data scales with O(N log(N)), where N is the number of simulated particles. Efficient analysis tools to extract certain observables from these data are required that exhibit a comparable scaling behavior to the algorithms that generate the trajectory data.
Identifying all atoms of a solute molecule which interact with the solvent, or all close atoms from different subunits of a molecular complex, is a recurring task. From a computational perspective, these tasks require one to identify all pairs of atoms that are closer to one another than a defined minimum contact distance. This task has been described as a spherical range search problem [3]. Here, we describe the efficient implementation of an algorithm to obtain contacting atom pairs of two sets of atoms and respective trajectory contact frequencies. The modified k-dimensional tree approach employed has a worst-case run-time of ∝ O(N log(N)) for two sets of size N compared with a run-time ∝ O(N 2 ) of a bruteforce approach. This high efficiency is achieved by excluding sets of distant atoms from the distance calculation. Combined with the excellent scaling properties of the method on parallel machines, this advantage will be particularly pronounced in future exascale computing applications.
The routine is implemented within gromacs [4]. Due to the versatile implementation, it can also be applied to other threedimensional contact searches. Extension to higher dimensions is straightforward.

Task
Given two sets of labeled atoms, A = {a i } and B = {b j }, and a minimum contact distance d, the task of the algorithm described here is to identify all contacting atom pairs, i.e., all pairs of atom require the calculation of the Euclidean distance between all possible pairs of atoms. The set decomposition scheme implemented here drastically reduces the number of necessary distance calculations and therefore the run-time.

Algorithm
For simplicity of presentation, we first assume the special case where one of the two sets, B, contains only one atom b ( Fig. 1(a)). This case will subsequently be generalized to arbitrary sets A, B ( Fig. 1(b), (c)).
As a first step, the minimum bounding box (bbox, yellow) of set A with sides aligned to the x, y, and z axes is determined. If the distance of b to the box boundary along the direction of the three coordinates exceeds a given contact distance d, b is not in contact with A, and the contact search terminates. Otherwise, A is decomposed into two subsets A = A 1 ∪ A 2 [3]. If the distance of the two child bboxes (gray, yellow) of subset A 1 to b or A 2 to b exceeds d, the respective subset is discarded. Alternatively, the child bbox is further decomposed into two disjoint subsets (not shown), and so on. Decomposition is terminated when all subsets contain less than a given minimum number of atoms. As a final step, the distances of b to all atoms in the remaining subsets are determined, and the indices i are stored for which ∥a i − b∥ < d. Fig. 1(b) generalizes the above decomposition procedure to a set B comprising more than one atom. In this case, the bbox is also determined for B (blue), and B is also recursively split into subsets. For each subset B q , all sets A p overlapping with B q are stored. When the decomposition terminates, only the distances for atom pairs i, j in stored pairs of sets A p , B q need to be calculated.

Application to ensemble data
The algorithm is applied to each frame of a given trajectory. Atom pair contacts are counted each frame. From these, the contact frequency is calculated by dividing the contact count by the number of frames analyzed. In addition to atom contact frequencies, residue contact frequencies are determined by defining two residues to be in contact if any of their respective atoms are in contact.

Efficiency
Four particular properties of the implemented algorithm render it efficient. First, subsets are decomposed along the median atom coordinates, which allows for the application of the median sort algorithm [5] such that the number of atoms in each subset is balanced in minimum run-time. Second, the order of atoms is kept from the previously analyzed frame. Thus the sorting effort is reduced if similar frames are analyzed. Third, after splitting a set A p into subsets A p 1 , A p 2 , overlap with subsets B q ⊂ B only needs to be checked if B q overlapped with A p in the previous step, thus saving a large fraction of overlap checks for newly generated sets. Fourth, decomposition is stopped as soon as the brute-force approach to identify contacts between subsets A p , B q becomes on average more efficient than further decomposition at an empirically determined upper boundary for the minimum set size n min .

Software structure
The contact search algorithm described here is implemented in C99. It uses the gromacs application programming interface (API) provided with the MD package gromacs 4.6 [4].

Data input
Input arguments are trajectory file names (flagged -f), a gromacs index file name that contains two index groups specifying each set of atoms (-n), a floating-point number that holds the minimum contact distance in nm (-d, by default d = 0.3 nm), and the threshold for the largest number of atoms in any node (-bsize). If the option (-resndx) is chosen, a gromacs structure file (-s) is read.

Optional switches
The optional switch -nopbc ignores periodic boundary conditions, speeding up the calculation; -resndx calculates the contacts between two groups of residues instead of two groups of atoms.

Data output
Contact frequencies are written to an output file (name given in -o). If the flag -resndx is set, an additional index file (name given in -on) is written, which contains one index group for each contacting residue and its atom indices.

Example runs
We performed example runs on a typical test case as well as on a worst-case scenario.
A typical case is provided by a simulation of adenosine triphosphate (ATP) molecules in solution which bind to ribonucleic acid (RNA) [6]. We used an MD simulation of a solvated RNA molecule comprising 1166 atoms and two ATP molecules in solution comprising 86 atoms. The atom pairs and contact frequencies of RNA and ATP that are closer than d = 0.3 nm were determined for 20 000 frames of the simulation. The index-file reads: The command to analyze the given trajectory traj.xtc is: g_contacts -f traj.xtc -n index.ndx .
A worst-case scenario is provided by two highly overlapping sets of atoms, where many set decompositions are required, and only a few subsets can be excluded from the contact search. The example considered here is a contact search in trajectories of a 1 ns simulation of TIP3P water in a periodic cubic water box of 5, 6, 7, 8 and 9 nm length (i.e., 12 426, 21 483, 34 251, 51 393, and 72 768 atoms, respectively), which were screened for contacts between sets of N consecutively labeled atoms. Contacts were searched between N = 1, 21, 41, . . . , 981 atoms. The default distance cutoff of d = 0.3 nm was applied.
The respective index-file for N = 21 and a 12 426 atom simulation reads: The analysis of the trajectory stored in traj.xtc was performed issuing the following command: g_contacts -f traj.xtc -n index.ndx .
The residue-based contact search determines contacts between water molecules, and was performed using g_contacts -s traj.gro -f traj.xtc -resndx .

Comparison with other methods
To compare the set-decomposition algorithm and the bruteforce approach, the CPU clock cycles were counted that were required for the respective contact search and the storage of the contacts, excluding trajectory-file reading routines.
For the system containing ATP and RNA, the required CPU cycles for contact search and storage were recorded for 101 analyzed frames, which were analyzed every 0.2 ns in a 20 ns trajectory. A speed-up of 9.1-fold was obtained for the implementation of the set decomposition algorithm over the brute-force approach.
In the water box simulation, the required CPU cycles were averaged over the analysis of 45 frames each. Fig. 2 illustrates that our approach exhibits the expected N log(N) scaling, even in this worst-case scenario. In contrast, the brute-force approach scales quadratically. For smaller boxes, a deviation from the ideal scaling behavior is observed. We attribute this deviation from N log(N) scaling to the fact that the number of contacts increases with decreasing box-size. The sorted list of lists approach employed for book-keeping of the contact pairs found results in a slightly worse overall complexity than N log(N) when very many contacts are found. Further, set decomposition works more efficiently if the subsets are less likely to overlap, which is the case for larger water boxes, explaining the different scaling offsets.
In the current implementation, the ''cross-over'' in efficiency between the set-decomposition algorithm and the brute-force algorithm (Fig. 2) is seen at set sizes of ≈40 atoms each, where the set-decomposition algorithm becomes faster. The setdecomposition algorithm reaches a speed gain of about ten-fold at ≈900 atoms per set.