Parallel Algorithm for Discovering and Comparing Three-Dimensional Proteins Patterns

Identifying conserved (similar) three-dimensional patterns among a set of proteins can be helpful for the rational design of polypharmacological drugs. Some available tools allow this identification from a limited perspective, only considering the available information, such as known binding sites or previously annotated structural motifs. Thus, these approaches do not look for similarities among all putative orthosteric and or allosteric bindings sites between protein structures. To overcome this tech-weakness Geomfinder was developed, an algorithm for the estimation of similarities between all pairs of three-dimensional amino acids patterns detected in any two given protein structures, which works without information about their known patterns. Even though Geomfinder is a functional alternative to compare small structural proteins, it is computationally unfeasible for the case of large protein processing and the algorithm needs to improve its performance. This work presents several parallel versions of the Geomfinder to exploit SMPs, distributed memory systems, hybrid version of SMP and distributed memory systems, and GPU based systems. Results show significant improvements in performance as compared to the original version and achieve up to 24.5x speedup when analyzing proteins of average size and up to 95.4x in larger proteins.

amino acids) that allows identifying new shared ligand-binding sites on a set of protein structures, represents a promising opportunity for the rational design of drugs with a poly-pharmacological profile [1], [2].Interestingly, these types of multitarget drugs might show better safety and efficacy profiles as compared with more selective compounds; for example, multi-kinase inhibitors have shown a positive performance against diseases such as osteomyelitis or some types of cancer [2], [3].Furthermore, recently FDA-approved multi-target antibiotics have been shown to be more effective than the traditional single-target drugs [4].Unfortunately, detecting similarities between ligand binding sites (e.g.orthosteric and/or allosteric sites) in protein structures remains a difficult task.Technically, there are two basic methodologies for comparing amino acid structural patterns: free-alignment and alignment-based.The first method, which is frequently less computationally expensive than alignment-based methods, computes similarity by comparing the shape and chemical properties of sites using the measure of the relative changes or differences between descriptors.Alignment-based techniques, on the other hand, may employ algorithms based on graphs, hashes, or cliques to perform structural superimposition, local fragment similarities, and/or searching for similarities within annotated pocket databases [5], [6].Nonetheless, most of the current computational tools for detecting 3D pattern similarity, such as PDBeMotif [7], RASMOT-3D PRO) [8], ASSAM [9], IMAAAGINE [10] or PatternQuery [11], are focused on identifying similarities between amino acids located at orthosteric binding sites or previously known patterns such as a motif, catalytic, or active sites [12].A 3D pattern, as operationally defined in our previous report [13], is a group of amino acids that are separated by a user-defined distance and whose size limits are defined by two parameters: Near Threshold (Nt) and Far Threshold (F t).These parameters establish that a set of amino acids can be a 3D pattern if each residue is at least Nt away from the same coordinate of reference in the virtual grid and at most Ft away from any other.As a result, Nt and F t can be interpreted as the dimensional limits of the 3D patterns being investigated.These definitions are fundamental for comparing patterns, pockets, motifs, or binding sites, which, unlike global protein alignments (structures or sequences), reveal conserved features even in highly different and/or unrelated proteins.The 3D patterns similarity measurement methods must incorporate into their calculation protocols, several time-consuming features of amino acids, such as atomic coordinates, physicochemical properties, and topological conformation, among others, for these purposes [14].Fortunately, advances such as parallel strategies (based on GPUs) or learning-based approaches have enabled much faster binding site similarity comparisons, which have been useful to identify functional amino acid patterns on protein structures [15], detect amino acid patterns associated with the occurrence of severe dengue infection [16] or identify amino acid motifs that predict new protein-drug interactions [17].A few years ago, we developed Geomfinder, a free-alignment and ligand-independent tool for discovering similar three-dimensional (3D) patterns between any pair of protein structures, in order to generate a novel strategy for rational design of multi-target drugs [13].Because the Geomfinder algorithm does not require any information about ligands or previously known binding sites of the queried structures, several novel similar 3D patterns could be discovered.Even though Geomfinder detects more 3D patterns than other tools, these findings imply time-consuming algorithmic methods and functions.For example, comparing the crystal structures of Pim1 (PDBid:2O3P) and Phosphoinositide 3-Kinase (PDBid:1E8W) requires more than 230,000 measures across all of the descriptors defined for each 3D pattern found at the protein structures; more than 500 million measures are required with the same proteins but using the deepest input parameters.In this context, a new tool for the discovery of conserved 3D patterns among several protein structures, taking the sequence component and the RMSD value into account, was developed using some simplified methods incorporated in Geomfinder.Even though this new tool allows for the simultaneous analysis of multiple protein structures, their less rigorous searching methods do not provide enough confidence in the results to be used as input to design multi-target drugs or establish a new evolutionary relationship between biological elements [18].Geomfinder results would be adequate for these purposes, based on their comparing methods that take into account the structural and physicochemical properties of the protein structures, but this is time-consuming.Thus, new Geomfinder features are required to achieve greater biological significance when analysing a large number of protein structures.In the present work, we describe technological enhancements to the Geomfinder method, whose biological application was recently reported as the "Case of Tafluprost and its Effects on Cardiac Ion Channels" [19].The new strategies were developed using various parallel programming models for shared memory architectures (OpenMP), distributed memory architectures (MPI and OpenMP w/MPI), and GPU-based systems (CUDA).Geomfinder2.0can be freely accessed at https://geomfinder2.appsbio.utalca.cl/and https://gitlab.com/amvaldesj/geomfinder

II. GEOMFINDER ALGORITHM
The original version of Geomfinder [13] is based on the detection and comparisons of four energetic and structural descriptors among all sites (3D amino acids patterns) between any two protein structures: i) list of distances between the geometric centers of the side chains of all the residues forming the site; ii) sum of the short and medium range of non-bonded energy of each residue forming the site; iii) list of residues forming a site; and iv) list of distances constituting the shortest pathway necessary to go over all the residues lining the site.Geomfinder processes each protein structure to discover sites (3D patterns) and calculates their descriptors, and then, it compares the features among all sites and determines the similarity between them.

A. Discovering (Finding) Sites
For each protein structure, a KDTree [20] data structure containing all the atomics x,y,z coordinates is built.In addition, a virtual grid of coordinates is created using the user-defined distance (input data; radius in Å) (Fig. S1 on Supp.Data).Once the KDTree and the virtual grid have been created, all the 3D patterns can be found.The site search process (see Algorithm 1) consists of looking for nearby amino acids in the KDTree structure for each virtual coordinate of the previously created grid.The algorithm receives the list of virtual coordinates (vcoord), the kdtree structure, and the input paremeters uMin and uMax, where uMin and uMax are the minimum and maximum searching rages, respectively.For each virtual coordinate i, several measures are carried out depending on the range determined by the uMin and uMax input parameters.To consider a valid site, the minimum number of amino acids close to a virtual coordinate must be 4 and no repeating elements.The sites should not be repeated either.If the site is valid, the aforementioned descriptors are calculated for the site.After that, two lists of sites corresponding to each protein structure, are compared in an all-to-all approach.The complexity of this algorithm is O(n×(uMax-uMin)), where n is the number of virtual coordinates and uMax-uMin is the searching range.

B. Comparing Sites
All sites detected on each pair of protein structures are compared in an all-to-all approach.Algorithm 2 shows how these comparisons are made and how the scores are obtained.The algorithm receives the lists (s1 and s2) of the sites found on each protein and the input parameters distPCT, nbePCT, ascPCT, tspPCT (percentage of contribution of each descriptor) and the scoreMin (minimum score value).At each iteration of the innermost-loop, a pair of sites are compared and the scores for all four descriptors are calculated (s_dist, s_tsp, s_nbe, s_asc).A final score (s_f) is calculated considering the score and the contribution percentage of each descriptor.Finally, a list of the sites (and their calculated scores) with a minimal final score is created and returned.The number of sites found in each protein might not be the same, and therefore this algorithm is O(n × m) where n and m are the number of sites found on each protein respectively.Both Algorithms 1 and 2 have operations on their innermost-loops that can be performed independently and that do not depend on other calculations; therefore, they are embarrassingly parallel.

C. Sequential Version Profile
The original version of Geomfinder [13] is a sequential version developed in the Python language.The baseline Geomfinder version used in this paper is a C/C++ ported version of the original Geomfinder.A timing and profiling analysis of the baseline Geomfinder has been done for a uMin value of 3 Å and for uMax between 5 Å and 11 Å.All testing are done using the chain A of the crystal structures of the Beta-1,4-Glycanase CEX-CD (PDBid:1EXP; 312 amino acids) and the Pim1 protein (PDBid:2O3P; 274 amino acids).A radius of 3 Å is used to construct the virtual grid of coordinates generating grids size of 316 and 276 (number of virtual coordinates) for the 1EXP and 2O3P proteins, respectively.The percentage of contribution of each of the descriptors is set to 25%.For each range evaluated, Fig. 1 shows the percentage of time invested of the total elapsed time by the functions find_sites() and compare_sites() (Algorithms 1 and 2, respectively).The number of sites discovered in both proteins, the number of comparisons scores.add(item)20: returnscores made, and the time (seconds and percentage of total elapsed time) invested in these tasks are shown in Table I.An important percentage of time dedicated to the comparison of sites (≈90%) and a significant increase in the total elapsed time once the one million comparisons are reached is observed.This is due to the fact that compare_sites() is a quadratic function.It also influences that the broader the search range the number of sites increases, since the sites found for each value in the range are accumulated.At the same time, the function find_sites() seems to be more stable regarding its time consumption since it is a linear function (uMax-uMin can be considered as a constant).One aspect to consider is that a protein could have a large number of amino acids, however, it is its shape (elongated instead of spherical for example) that could determine the number of sites that could be found.

III. PARALLEL ALGORITHMS
Considering the previous analysis, as the number of sites to be compared increases, both the elapsed time and the percentage of time used in this task increase considerably, reaching levels that suggest that it is worthwhile to parallelize this function.In addition, the sequential site comparison function (Algorithm 2) shows that the iterations in the innermost-loops are independent and therefore allow for embarrassing parallelism.Table S9 (on Supp.Data) shows details of the ideal achievable speedups according to the percentage of parallelizable code portion using 25 parallel processes.Fine-grained parallelism at the loop level has been used as a parallelization strategy in which each task processes a group of iterations, in this case, a group of comparisons.The different parallel versions are presented below.

A. OpenMP
OpenMP [21], [22] provides the directive #pragma omp for collapse (level) which allows to collapse nested loops (defined by the level value) into a single loop and then split them according to a schedule clause and parallelize them.Algorithm 3 shows how the compare_sites() function has been modified to incorporate the OpenMP directives.A temporary array thread_scores[nthreads] of  thread_scores [id].add(item)24: } 25: returnthread_scores size nthreads is defined to store in each position a list of scores calculated in its assigned iterations range.A parallel region of nthreads size is created and the thread identifier of the parallel region is obtained for each thread.This id is used in the innermost-loop to index the thread_scores[id] array.The two loops are collapsed into one using the directive #pragma omp for collapse (2) schedule(runtime), thus, assigning enough work to all threads.The scheduling option has been defined as runtime to perform a design space exploration and determine which schedule performs better.Both the number of threads and the loop schedule are defined at run time.

B. MPI
In this version with MPI [23], the strategy consists of equitably distributing the number of comparisons to be made.Each node (master and slaves) processes a chunk of the total calculations.A struct Score {float nbe, float dist, float tsp, float asc, float final} is defined to store the scores, and a MPI_Datatype score_type is created to send/recieve these scores.This new datatype contains five MPI_FLOAT.Algorithm 4 shows how the master node communicates and works with the slave nodes.The arrays of sites s1 and s2, the number of nodes (nodes) to be used, and the parameters distPCT, nbePCT, ascPCT, tspPCT, and scoreMin are passed to the algorithm.First, the master node (rank with value 0) packs (with MPI_Pack) the arrays of sites s1 and s2 into the msg1[BUFSIZ] and msg2[BUFSIZ] buffers respectively.These MPI_PACKED packets are then sent to each node (including the master node) using the MPI_Bcast function.After packets are received on all nodes, the master node defines a scoring matrix scores[s1.size][s2.size] of struct Score to store all site comparisons that are made on the same node and on slave nodes.A chunk value is also calculated and used to define how many sites of the first array are compared to all sites on the second array at each node (Fig. 2 shows how the comparisons are distributed to the nodes).In addition, an offset value is calculated to control the movement both in the array with sites of the first protein, as well as in the scoring matrix.After these calculations, the number of sites of both proteins (s1.size and s2.size), chunk value, and the offset value are sent to the (nodes-1) slaves.The corresponding chunk of comparisons is made in the function compare_chunk() (see Algorithm 5) and the result is copied into the scoring matrix, then the rest of the results are awaited from the slave nodes.Each slave returns the assigned offset value and the calculated scoring items which are copied directly into their respective positions in the scoring matrix.On the other hand, the slave nodes (ranks greater than 0) receive the number of sites of each protein (N and M), the chunk, and the offset values.Then, the two MPI_PACKED packets received previously (via MPI_Bcast), which contain the sites of each protein, are unpacked (using MPI_Unpack function) into two arrays, s1[N] and s2[M], respectively.A temporal After the comparisons are made, the offset value and the temporal scoring matrix scores_chunk are sent to the master node.A new function compare_chunk() has been created to perform part of the total comparisons (see Algorithm 5).The inputs for this function are the range (start and stop) of the array of sites of the first protein, the matrix of temporal scores (s_chunk), the arrays with the sites (s1 and s2), and the user parameters (distPCT, nbePCT, ascPCT, tspPCT).The outer loop i controls the range of sites in the first array that are compared to all sites in the second array (inner loop j).
The s_chunk matrix is indexed as a one-dimensional array and an auxiliary variable k is used to calculate their indices.

C. MPI-OpenMP
In the previous MPI version, parts of the total comparisons are distributed on the nodes (master and slaves).In this hybrid version, these parts of comparisons are parallelized locally on each node with OpenMP.Algorithm 6 shows the modified version of the function com-pare_chunk() (Algorithm 5).The additional parameters nthreads and chunk are required.The first, to establish the number of threads, and the second is to control that the number of indexes built does not exceed the number of comparisons in the node.Loops are parallelized and collapsed using the #pragma omp for collapse (2) schedule(runtime) directive.The calculations are done in the same way, but now they are parallelized.Both the number of threads and the schedule for the loop are defined at run time.s1_d[N] and s2_d[M] are also created on the device.To pin the data from each array of sites on the host, the cudaHostRegister function is used.A two-dimensional grid is defined as numBlocks (ceil(N/blockDim.x),ceil(M/blockDim.y))where N and M are the number of sites of the first and second protein, respectively.Fig. 3 shows an example of how the work is decomposed.A kernel function is created to perform the all comparisons on the device (see Algorithm 7).This function receives the arrays, the number of sites found in each protein, and the percentage contribution of each site descriptor.The idx indice of sc_d[idx] is calculated according the i, j, and M values.The kernel function is called only once.Functions Td(), Nbe() and T() are implemented on the device too.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

IV. PERFORMANCE EVALUATION
This section presents the experimental setup that has been used, details of tests of various configurations for the new versions, the results of each new version using its best configurations, and finally additional tests using only the CUDA version on a set of proteins of different sizes.

A. Experimental Setup
All experiments were performed on the Barcelona Supercomputing Center, in a cluster with 25 compute nodes, each of them with 2 x IBM Power9 8335-GTH @ 2.4 GHz (3.0 GHz on turbo, 20 cores and 4 threads/core, total 160 threads per node), 512 GB of main memory, 4 x GPU NVIDIA V100 (Volta) with 16 GB HBM2, GPFS via one fiber link 10 GBit.Geomfinder was developed in C/C++.The GNU C++ compiler 9.2 and the C++11 standard have been used.Compilations include the -O3 optimization option and the -mcpu=power9 and -mtune=power9 flags.For the versions using OpenMP, -fopenmp flag has been added.The openmpi 4.0 library is used for MPI version.The nvcc 10.2 and GNU C++ 7.3 compilers are particularly used for CUDA version.Extrae 3.8.3[25] and Paraver 4.8.2 [26] tools are used to trace and analyze some versions.

B. Testing Configurations
In the OpenMP version, static and dynamic schedules have been tested to determine the best schedule to use.Traces have been obtained using the 3 Å-11 Å range and 25 threads on one node.The running time of the static schedule is longer than the dynamic schedule.In the static schedule, it is possible to see an imbalance in the distribution of iterations to the threads, therefore many of the threads run poorly and waste time from the processors (Figure S2a on Supp.Data).This imbalance is due to the variable size of the sites to be compared.On the other hand, with the dynamic schedule, a much better distribution of iterations is achieved by keeping all threads fully occupied, thus reducing execution time (Fig. S2b on Supp.Data).Fig. S3 in the supplementary data shows details of these tests for the other ranges.In all cases, dynamic schedule has the best performance.In the implementation with MPI two configurations have been tested.The first one performs the comparisons only on the slave nodes and uses only the point-to-point communication primitives MPI_Send and MPI_Recv to send/receive the sites data.On the other hand, in the second configuration, all nodes (including the master) perform comparisons and the collective communication function MPI_Bast is used to sending the sites data.The use of MPI_Bcast has been more effective to send the data of the sites to all the nodes.Figure S4 in the supplementary data shows the traces obtained in these two configurations for the 3 Å-11 Å search range.This version also suffers from load imbalance due to the static distribution of comparisons to the nodes, which certainly affects its performance compared to the others versions.The MPI-OpenMP version incorporates the use of MPI_Bcast to send the sites data to the slaves (and master node) and the use of the dynamic schedule to distribute the iterations of the loops.To determine the appropriate number of threads per process on each node, several tests have been carried out with different configurations.The configuration that achieves at least 80% (or close) of efficiency in the use of the processors has been selected, in this case, four threads per process (Figure S3c on Supp.Data).The CUDA implementation has been tested using different configurations of threads per block for a two-dimensional grid on a single device.Fig. S5 in the supplementary data shows details of these tests.For a greater number of comparisons, the (8,8,1) configuration achieves up to 3x more acceleration than the (32,32,1) configuration.

C. Results
To compare the performance of these implementations, we have considered executions in ranges where the parallelizable percentage of the compare_sites() function is near to or greater than 90% (see Table I), thus, uMin fixed at 3 Å and uMax in the range of 8 Å to 11 Å.Figs.S6, S7, and S8 in the supplemental data show the speedup, performance, and efficiency comparisons using the best configurations of the OpenMP, MPI, and MPI-OpenMPI versions respectively.The results have been compared with the baseline C/C++ version of Geomfinder.It is possible to see that the OpenMP implementation always has the best performances compared to the MPI and MPI-OpenMP implementations, particularly in the 3 Å-11 Å range where ≈3.5 million comparisons are performed achieving up to 17.6x acceleration.Although in all the evaluated ranges the acceleration is close to the theoretical value and the execution times decreased considerably, the OpenMP implementation is not very efficient when the number of comparisons decreases.For example, in the 3 Å-8 Å range (≈1 million comparisons) after 10 processors, the efficiency drops to 50% and continues to decrease, instead, in the range 3 Å-11 Å with the same number of processors the efficiency is 90%.The MPI and MPI-OpenMP implementations also show good results achieving up to 11.2x and 12.9x acceleration respectively in the last range, although with an efficiency close to 50% and lower.CUDA

D. Other Tests With CUDA Version
Several other pairs of protein structures of different sizes have been tested using only the CUDA version with the (8,8,1) configuration.The input parameters used correspond to a detailed level of search.Table II shows details of these executions.The results show the advantages of CUDA by having thousands of threads in a single device to perform the comparisons.We can see that the more comparisons required, the better performance is achieved.

V. CONCLUSION
In this article, we present different parallel strategies that use shared memory (OpenMP), distributed memory (MPI), and hybrid architectures (MPI-OpenMP, CUDA) to parallelize an algorithm that discovers and compares three-dimensional patterns of proteins (Geomfinder).The best accelerations have been obtained with the CUDA version achieving a speedup of up to 24.5x using proteins of average size and 95.4x with larger proteins.With these improvements, it is possible to compare larger proteins (quaternary structures) and increase the number of proteins to be compared without significantly increasing the execution time.

Fig. 1 .
Fig. 1.Executions of the C/C++ sequential baseline version with representative search ranges.The range and the number of comparisons performed in the compare_sites() function are displayed.In addition, elapsed times are shown.

Fig. 2 .
Fig. 2. Workload distribution with MPI.Each node (including the master) performs a chunk of comparisons and store temporally these in s_chunk[chunk][M].A chunk of the sites in the first array s1[N] is compared to all sites in the second array s2[M].

Fig. 3 .
Fig. 3. Example of decomposition of work with CUDA.Arrays in device s1_d[N] and s2_d[M] of sizes N and M contain the sites found in each protein evaluated.With N=M=8, the one-dimensional device score array sc_d[NxM] contains 64 elements.Grid and block dimensions are defined as numBlocks(2,2) and threadsPerBlock(4,4,1) respectively.The image also shows the position of the threadIdx (x=2, y=1) in the grid for the element idx=10 in sc_d.

Fig. 4 .
Fig. 4. Graph shows the accelerations obtained by each new version using its best configurations.

Alejandro
Valdés-Jiménez received the degree in informatics engineering in 2001 and MS degree in communication, networks, and content management from UNED, in 2011.He is currently working toward the PhD degree with the Computer Architecture Department, UPC.He works as a professor with the Department of Information Systems, University of Bío-Bío.His research interests include parallel architectures, parallel, and distributed programming, and bioinformatics.Miguel Reyes-Parada received the PhD degree in neurobiology, in 1996.He is a pharmacist and chemist.He is a full professor in pharmacology with the faculties of Medicine of the Universidad de Santiago de Chile and Universidad Autónoma de Chile.He is a researcher with the CIBAP of the University of Santiago.His research interests include cover areas such as the rational and computer-assisted drug design for the treatment neuropsychiatric conditions, as well as processes related to polypharmacology and drug repositioning.Gabriel Núñez-Vivanco received the PhD degree in biotechnology, in 2015.He is a bioinformatic engineer.He is academic of the Natural Sciences and Technology Department, University of Aysen, Chile.Co-founder and Partner of the Bioinformatics Chilean Society.His research interests include de computer science applied to computational biology, in particular, processes related to polypharmacology and the drug repositioning Daniel Jiménez-González received the MS and PhD degrees in computer science from UPC, in 1997 and 2004, respectively.He currently holds a position as a tenured assistant professor with the Department of Computer Architecture in UPC and is an associated researcher with BSC.His research interests include cover the areas of parallel architectures, runtime systems and reconfigurable solutions for highperformance computing systems.He has been participating in the HiPEAC Network of Excellence and the SARC, ACOTES, TERAFLUX, AXIOM, PRACE, and EUROEXA European projects.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
scoring matrix scores_chunk[chunk][M] of struct Score is defined to store temporal site comparisons.With the chunk and offset values, the start and stop range of the array s1[N] are calculated.Thus, only this range of sites on the array s1[N] is compared to all sites on the array s2[M].Function compare_chunk() (Algorithm 5) is called to perform the corresponding comparisons.

TABLE II DETAILS
OF THE EXECUTIONS OF A LIST OF PROTEIN PAIRS