Measuring the shapes of macromolecules – and why it matters

The molecular basis of life rests on the activity of biological macromolecules, mostly nucleic acids and proteins. A perhaps surprising finding that crystallized over the last handful of decades is that geometric reasoning plays a major role in our attempt to understand these activities. In this paper, we address this connection between geometry and biology, focusing on methods for measuring and characterizing the shapes of macromolecules. We briefly review existing numerical and analytical approaches that solve these problems. We cover in more details our own work in this field, focusing on the alpha shape theory as it provides a unifying mathematical framework that enable the analytical calculations of the surface area and volume of a macromolecule represented as a union of balls, the detection of pockets and cavities in the molecule, and the quantification of contacts between the atomic balls. We have shown that each of these quantities can be related to physical properties of the molecule under study and ultimately provides insight on its activity. We conclude with a brief description of new challenges for the alpha shape theory in modern structural biology.


. Introduction
The advent of high-throughput technologies and the concurrent advances in information sciences have led to a data revolution in biology. This revolution is most significant in molecular biology, with an increase in the number and scale of "omics" projects over the last decade. Genomics projects for example have produced impressive advances in our knowledge of genes and their encoded protein structures, proteomics initiatives help to decipher the role of posttranslation modifications on these structures and provide maps of protein-protein interactions, and functional genomics is the field that attempts to make use of the data produced by these projects to understand protein functions. However, the biggest challenge today is to assimilate this wealth of information into a conceptual framework that will help us decipher life. For example, the current views of the relationship between protein structure and function remain fragmented. We know of their sequences, more and more about their structures, and we have information on their biological activities, but we have difficulties connecting these dots into a knowledgeable whole. We currently lack the experimental and computational tools for directly studying protein structure, function, and dynamics at the molecular and supra-molecular levels. In this paper, we review some of the current developments in building the computational tools that are needed, focusing on the role that geometry plays in these efforts.
It is worth mentioning first that geometric reasoning has been known to play a major role in chemistry and biology for a few decades now. Indeed, molecular structure or shape and chemical reactivity are highly correlated as the latter depends on the positions of the nuclei and electrons within the molecule. Chemists have long used threedimensional plastic and metal models to understand the many subtle effects of structure on reactivity and have invested in experimentally determining the structure of important molecules. The same applies to biochemistry, where structural genomics projects are based on the premise that the structure of macromolecules implies their function. Physical properties of these molecules are then often expressed in terms of their geometry. For example, potential active sites are often assimilated with cavities [6,7] while interactions with the environment are quantified through the surface area and/or volume of their shapes . This link between solvation and geometry has led to the development of implicit solvent models that play an essential role in improving simulations of molecular dynamics.
Protein dynamics is multi-scale: from the jiggling of atoms (picoseconds), the domain reorganizations in proteins (micro to milliseconds), protein folding and diffusion (milli-second to seconds), binding and translocation (seconds to minutes). Connecting these different scales is a central problem in polymer physics that remains unsolved, despite numerous theoretical and computational developments (for review, see [ 3,4]). Computer simulations play an essential role in all corresponding multi-scale methods, as they provide information at the different scales. Usually, computer simulations of protein dynamics start with a large system containing the protein and many water molecules to mimic physiological conditions. Given a model for the physical interactions between these molecules, their space-time trajectories are computed by numerically solving the equations of motion. These trajectories however are limited in scope. Current computing technologies limit the range of time scales that can be simulated to the microsecond level, for systems that contain up to hundred thousands of atoms [ 5]. There are many directions that are currently explored to extend these limits, from hardware solutions including the development of specialized computers [ 6] or by harnessing the power of graphics processor CSBJ Abstract: The molecular basis of life rests on the activity of biological macromolecules, mostly nucleic acids and proteins. A perhaps surprising finding that crystallized over the last handful of decades is that geometric reasoning plays a major role in our attempt to understand these activities. In this paper, we address this connection between geometry and biology, focusing on methods for measuring and characterizing the shapes of macromolecules. We briefly review existing numerical and analytical approaches that solve these problems. We cover in more details our own work in this field, focusing on the alpha shape theory as it provides a unifying mathematical framework that enable the analytical calculations of the surface area and volume of a macromolecule represented as a union of balls, the detection of pockets and cavities in the molecule, and the quantification of contacts between the atomic balls. We have shown that each of these quantities can be related to physical properties of the molecule under study and ultimately provides insight on its activity. We conclude with a brief description of new challenges for the alpha shape theory in modern structural biology. units [ 7] to the development of simplified models that are computationally tractable and remain physically accurate. Among such models are those that treat the solvent implicitly, reducing the solute-solvent interactions to their mean-field characteristics. These so-called implicit solvent models are often applied to estimate free energy of solute-solvent interactions in structural and chemical processes, folding or conformational transitions of proteins and nucleic acids, association of biological macromolecules with ligands, or transport of drugs across biological membranes [8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27]. The main advantage of these models is that they express solute-solvent interactions as a function of the solute degrees of freedom alone, more specifically on its volume and surface area. In this review, we will discuss how these geometric measures are usually computed for macromolecules.
The paper is organized as follows. The next section provides a brief description of the representations of macromolecules and the mathematical definitions of their boundaries or surfaces. The following section reviews popular methods for computing the geometric measures of macromolecules using their most common representation, i.e. a union of balls. The following section covers our work on the alpha shape theory and its application to measuring macromolecules. The result section provides a small review of recent applications of the alpha shape theory to analyze the structures macromolecules, as well as examples of application for characterizing atomic environments with protein and detecting putative drug target sites in RNA. We then conclude with a discussion of future research directions.
2. The geometry of macromolecules Molecular structure and chemical reactivity are highly correlated as the latter depends on the positions of the nuclei and electrons within the molecule: this correlation is the rationale for high resolution studies of the structures of bio-molecules. Early crystallographers who studied proteins and nucleic acids could not rely-as it is common nowadays-on computers and computer graphics programs for representation and analysis of their structures. They had developed a large array of finely crafted physical models that allowed them to have a feeling for these molecules. These models, usually made out of painted wood, plastic, rubber and/or metal were designed to highlight different properties of the molecule under study. In the space-filling models, such as those of Corey-Pauling-Koltun (CPK) [28,29], atoms are represented as balls, whose radii are the atoms' van der Waals radii. The CPK model has now become standard in the field of macromolecular modeling: a bio-molecule is represented as the union of a set of balls, whose centers match with the atomic centers and radii defined by van der Waals parameters. The structure of a biomolecule is then fully defined by the coordinates of these centers, and the radii values. The macromolecular surface is the geometric surface or boundary of these unions of balls. Note that other definitions are possible; this will be discussed in more details below.
As described above, there is no consensus in computational biology as to which surface of the union of balls best relates to the physical properties of the molecule. Three models are widely used; namely, the van der Waals surface, the solvent accessible surface, and the molecular surface (see Figure for a 2D illustration).
The van der Waals surface, vdWB, is defined as the boundary of the union of balls . It consists of a number of spherical patches meeting at common circular arcs. Lee and Richards [8] defined the solvent accessible surface SASB of a molecule as the locii of the center of a probe sphere with radius Rw as it rolls over the van der Waals surface vdWB. The value of Rw is usually set to .4 Å as it approximates the size of a water molecule. It can be shown that SASB is also the boundary of the union of balls , where Bw are "hydrated" balls representing the atoms, i.e. the vdW balls whose radii have been increased by Rw.
The molecular surface, MSB, was introduced by Richards [ 0] as an alternate to the van der Waal's surface and the solvent accessible surface. It is defined as the surface traced out by the front of the probe sphere while it rolls over vdWB (see left panel in Figure for a two dimensional example). The molecular surface consists of three types of patches, namely, spherical patches, toroidal patches and inverse spherical patches.
While geometric models (such as the union of balls discussed above) for the molecular surface provide a deterministic description of the boundary for the shape of a macromolecule, surface models using implicit or parametric surfaces may be favorable for certain applications [30,3 ].
The implicit molecular surface models use a level set of a scalar function f: 3  that maps each point from the three dimensional space to a real value [32][33][34]. The most common scalar function used for macromolecular surfaces is a summation of Gaussian functions [35]. Other scalar functions such as polynomial and Fermi-Dirac switching function have been used as well [36]. Bates et al. [37] proposed the Minimal Molecular Surface as a level set of a scalar function that is the output from a numerical minimization procedure. Parametric surface models specify each point on the macromolecular surface by a pair of real value variables. Piecewise polynomials such as Non-Uniform Rational B-spline (NURBS) and Bernstein-Bézier have been proposed to generate parametric representations for molecular surfaces [30,38]. Spherical harmonics and their extensions parameterize the macromolecular surface using spherical coordinates and provide a compact analytical representation of macromolecular shapes [39,40,4 ].
We note that both implicit and parametric macromolecular surface models are not independent from the geometric models based on union of balls, as they usually have a set of parameters that are tuned such that they provide a reasonable approximation of the surface of the latter. We restrict this section to the description of the macromolecular surface models based on spherical harmonics functions.
Spherical harmonics are single valued complex functions defined on a unit sphere using spherical coordinates (, ), that is, the associated Legendre polynomials. Any surface F that is topologically equivalent to a sphere can be approximated by a linear combination of spherical harmonics basis function in which clm is the expansion coefficient. Since the spherical harmonics form a complete orthonormal basis, the parameterization of ( ) is unique and the coefficients are independent [39]. It is possible to build spherical harmonics representations for a macromolecular surface S by truncating the infinite series in l of the basis functions to a value L that is chosen according to a desired level of approximation.
The coefficients clm are then evaluated based on a representation of S in spherical coordinates [39].
The spherical harmonics representation provides a complete analytical formula for the macromolecular surface. It facilitates multiresolution approximations of molecular shapes and efficient shape comparison algorithm by taking the expansion coefficients clm as shape descriptors [4 ,42].
It should be noted that the spherical harmonics representation can only be applied to a macromolecule whose boundary is star like, that is, the radial function ( ) is single valued. This restriction has limited the application of spherical harmonics based macromolecular surface as many of the macromolecular surfaces have non-zero genera due to the presence of tunnels and overhangs that lead to radial functions ( ) that are not single valued. To circumvent this problem, an extension of the spherical harmonics called 3D Zernike functions has been proposed for modeling macromolecular surfaces [43][44][45].

Measuring macromolecules
A common concrete model representing a molecular shape is a union of balls, in which each ball corresponds to an atom, with its center set at the position of the nucleus of the atom and its radius set to the vdW radius of the atom. In what follows, we discuss the geometric properties of such union of balls, more specifically how we can measure their volume and surface area, how we can detect their pockets and cavities, and how we can quantify interactions between the balls.
Computing the surface area and/or volume of a union of overlapping balls is not a trivial task. The original approach of Lee and Richards [8] computed the surface area by first cutting the union of balls with a set of parallel planes. The intersection of a plane with a ball, if it exists, is a circle that can be partitioned into accessible arcs on the boundary and occluded arcs in the interior of the union. The accessible surface area of an atom i is then the sum of the contributions of all its accessible arcs, computed approximately as the product of the arc length and the spacing between the planes defining the arc. This method was originally implemented in the program ACCESS [8]. Shrake and Rupley [46] refined Lee and Richards' method and proposed a Monte Carlo numerical integration of the accessible surface area. Their method placed 92 points on each atomic sphere, and determined which points were accessible to solvent (not inside any other sphere). Efficient implementations of this method include applications of look-up tables [47], vectorized algorithms [48] and parallel algorithms [49]. Similar numerical methods have been developed for computing the volume of a union of balls [50][51][52][53]. It is also worth mentioning MSMS, a program that allows for computing very efficiently an approximation of the surface area of a macromolecule by generating a triangulated version of its surface [54].
The surface area and/or volume computed by numerical integration over a set of points, even if closely spaced, is not accurate and cannot be readily differentiated. To improve upon the numerical methods, analytical approximations to the accessible surface area have been developed, which either treat multiple overlapping balls probabilistically [55][56][57] or ignore them altogether [58,59]. While these approaches are approximative, they are fast and lead to differentiable geometric measures. In addition, they are well suited for hardware acceleration on graphics processing units [60].
Even better analytical methods describe the molecule as a union of pieces of balls, each defined by their center, radius, and arcs forming their boundary, and subsequently apply analytical geometry to compute the surface area and volume [6 -65]. For example, Pavani and Ranghino [5 ] proposed a method for computing the volume of a molecule by inclusion-exclusion. In their implementation, only intersections of up to three balls were considered. Petitjean however noticed that practical situations for proteins frequently involve simultaneous overlaps of up to six balls [64]. Subsequently, Pavani and Ranghino's idea was generalized to any number of simultaneous overlaps by Gibson and Scheraga [4] and by Petitjean [64], applying a theorem that states that higher-order overlaps can always be reduced to lower-order overlaps [66]. Doing the reduction correctly remains, however, computationally difficult and expensive. The alpha shape theory solves this problem using Delaunay triangulations and their filtrations, as described by Edelsbrunner [67]. It will be discussed in the next section.
The distinction between approximate and exact computation also applies to existing methods for computing the derivatives of the volume and surface area of a molecule with respect to its atomic coordinates [68][69][70][71][72][73]. In the case of the derivatives of the surface area, computationally efficient methods were implemented in the MSEED software by Perrot et al. [74] and in the SASAD software by Sridharan et al. [75]. All these methods introduce approximations to deal with singularities caused by numerical errors or by discontinuities in the derivatives [70].
The problem of detecting and measuring internal cavities of macromolecules is very popular as these cavities correspond to putative binding sites for drugs and thus represent attractive leads for the design of therapeutic drugs. Most solutions to this problem rely heavily on geometry. They can be divided into three categories: (i) the grid-based methods, (ii) the probe sphere detection methods, and (iii) the analytical methods.
In the grid-based method, the molecule is positioned on a threedimensional Cartesian grid whose vertices are then sorted into two groups: those that are covered by a protein atom and those that are not. The latter are further characterized as being inside a pocket if they satisfy some geometric conditions (such as being inside and at a distance greater than the radius of a water molecule from the convex hull of the macromolecule). The measures of these pockets (volume and surface area) are then computed by Monte Carlo integration over their corresponding grid points. POCKET  The probe sphere method proceeds by placing probe spheres that are tangent to the surfaces of two atoms of the biomolecules and then Measuring molecular shapes reducing their radii to eliminate overlaps with neighboring atoms; all remaining spheres whose radii exceed a minimal cutoff value (usually Å) are used to define the pockets and cavities. This method was originally implemented in the program SURFNET [80] and later modified in the programs PASS [8 ] and PHECOM [82]. Interestingly, the grid-based and probe sphere methods were recently combined in the program POCASA [83].
The alpha shape theory combined with the discrete flow concept was the first analytical method proposed for detecting and measuring inaccessible cavities [84] as well as pockets [6,85] in macromolecules. It has been extended since to detect channels between inner cavities and the outside [86].
The program CAVE implements a complementary approach in which the boundaries of the pockets are directly triangulated, forming the so-called enveloping triangulation [87].
While exact theories for computing the surface area and volume of a union of balls exist, the computations of contact areas between balls are more ambiguous as there is no unique definition of what a "contact" is. Three overlapping balls provide a simple illustration of this problem. The regions of the balls that are covered by exactly two balls can be easily partitioned between the corresponding balls. Partitioning the region that is covered by all three balls, however, is more ambiguous. Most methods that compute the contact areas between atoms in a molecule rely on a Voronoi partitioning of such overlapping regions; the contact between two atoms is then defined as the area of the face that separates their Voronoi regions (see for example [88][89][90][91][92]). We note that these methods require special care for atoms on the surface of the molecule of interest, as the corresponding Voronoi cells are unbounded; this is usually resolved by adding water molecules based on molecular dynamics simulations [88,92]. Finally we mention that Apollonius diagrams (also called additive Voronoi diagrams) have also been used to provide an alternate definition of contacts [93,94]. 4. The alpha shape theory: a general framework to characterize the geometry of macromolecules Given a collection B={Bi} of N three-dimensional balls, the volume and the surface area of the union of B can be computed using the principle of inclusion-exclusion. That is, the volume and surface area of the union can be expressed as an alternating sum of volumes and surface areas of the common intersections of the subsets of B, where  stands for either the volume V of the union of balls or the area of its boundary A. There are two issues that need to be solved to make this equation computationally tractable. Firstly, we need to have a consistent way to reduce significantly the number of terms in the inclusion-exclusion formula; brute force application would lead to an algorithm with exponential running time, as the total number of terms is 2 N -, with each term corresponding to the measure of the intersection of at most N balls. Secondly, we need analytical formula for computing the non-empty intersections of balls. The first requirement was elegantly solved with the alpha shape theory. It is based on the concept of Voronoi decompositions and Delaunay triangulations and their filtrations, as proposed by Edelsbrunner [67]. We illustrate its application to measuring the shape of a protein in Figure 2 and describe briefly its major components below. For a more comprehensive description, we refer the reader to the original paper of Edelsbrunner and to some application papers [7,84,95,96]. It is noteworthy however that Naiman and Wynn had introduced the concept of using the Voronoi decomposition and Delaunay triangulation to simplify the inclusionexclusion formula from a statistical perspective a little earlier [97].
Voronoi decomposition and dual complex Let us consider a finite set of spheres Si with centers ci and radii ri and let Bi be the ball bounded by Si. We define the square distance between a point x and a sphere Si as ( ) ‖ ‖ . This distance definition allows for varying radii for the spheres.
The Voronoi region Vi of the sphere Si consists of all points x that are at least as close to Si as to any other sphere, Si and Sj if the two corresponding Voronoi regions share a common face, called a Voronoi plane. Furthermore, we draw a triangle connecting ci, cj and ck if Vi, Vj and Vk intersect in a common line segment, called a Voronoi edge, and similarly we draw a tetrahedron between four centers if their Voronoi regions meet at a common point, called a Voronoi point. Assuming general position of the spheres, there are no other cases to be considered: this is a central property of the Delaunay triangulation that will lead to a significant simplification of the inclusion-exclusion formula (see below).
Let us now limit the construction of the weighted Delaunay triangulation to within the union of balls. In other words, we draw a dual edge between the two vertices ci and cj only if and share a common face, and similarly for triangles and tetrahedra. The result is a sub-complex of the Delaunay triangulation, which is referred to as the dual complex K of the set of spheres.
It is often useful to alter the spheres by increasing or decreasing their radii (we will see one application in the result section to study pockets in a large RNA molecule). We do this in a way that leaves the Voronoi diagram invariant. Let us model growth with a positive real number denoted α 2 . For each i let Si(α) be the sphere with center ci and radius . The alpha complex K  of the spheres Si is the dual complex of the spheres Si(α).
Measuring the volume and surface of the union of spheres As proved in [67], the inclusion-exclusion formula that corresponds to the dual complex gives the correct volume of a union of balls, as well as the correct area of its boundary, the union of spheres. Here we state the corresponding theorem for the volume. Let si be the simplex corresponding to the ball Bi, sij the simplex formed by the edge between the centers of the balls Bi and Bj, sijk the triangle corresponding to the three balls Bi, Bj, and Bk, and finally sijkl the tetrahedron defined by the four balls Bi, Bj, Bk, and Bl.
Volume Theorem: Here V(Bi) is the volume of the ball Bi, Vi:j is the contribution of Bi to the volume of the intersection of the balls Bi and Bj, etc. A similar theorem is used to compute the surface area A. They overcome the exponential complexity of the inclusion-exclusion formula by implicitly reducing higher-order to lower-order overlaps. In addition, we note that the balls in each term form a unique geometric configuration and that the analytic calculations of the volume and surface area can be done without case analysis [67].
Several formulas have been developed for computing the volumes and surface areas of the intersection of two, three and four balls with unequal radii (see for example [4,98,99]). Of particular interest to macromolecule structure modeling, we have recently derived new formulas that satisfy a specific constraint, namely that the volume and surface area intersections are only expressed as functions of the radii of the balls and the distances between their centers [96]. . To compute its geometric properties, we proceed in three steps: (B) first, we compute the weighted Delaunay triangulation (shown as blue edges) of all the atomic balls representing the protein (not including the inhibitor); the Delaunay triangulation is then filtered, to yield the dual complex (C) and a set of pockets (D). The dual complex (in red) is the subset of the Delaunay triangulation that is limited to simplices whose corresponding balls have a non-empty intersection. The largest pocket, shown in green is found to align with the position of KC32 in the protein structure. Three alternate pockets are shown in purple, magenta, and red (D).

Detecting pockets in a union of spheres
A full description of how to detect and measure pockets in a union of balls based on the alpha shape theory is available in [6]. Briefly, the concept of pockets is ultimately connected to the notion of a continuous flow field defined on the Delaunay triangulation of these balls. Let be the set of tetrahedra in the Delaunay triangulation and where is a dummy element representing the complement of the triangulation in . The flow relation ' ' with is defined by: (i) and share a common triangle , and (ii) The interior of and the orthogonal center lie on different sides of the plane defined by . The orthogonal center is the center of the smallest ball that is orthogonal to all four balls whose centers are the vertices of . If , is said to be a predecessor of and is then a successor of .
is a sink if it has no successors; in other words, a tetrahedron is a sink if and only if it contains its orthogonal center. Sinks are important since they are responsible for the formation of voids: if H is a void of the union of balls then at least one tetrahedron in H is a sink.
By definition, pockets consist of the Delaunay tetrahedra that do not belong to the dual complex K and are not ancestors of . The voids are the only pockets without connection to the outside. All other pockets connect to the outside at one or more places, called mouth. Figure 2 illustrates these concepts for the HIV-protease.
The tetrahedra that form the four major pockets detected by this method are shown overlaid with the structure of the protein.
Interestingly, we find that the main pocket (shown in green) matches with the position of the inhibitor detected in the X-ray structure (see Figure 2, panel D) [3].
The surface area and volume of a pocket are easily computed by first identifying their tetrahedra and their faces that belong to the dual complex followed by the application of simplified inclusion-exclusion formulas similar to those used for measuring the dual complex (see [6,7] for details).
The computation of contact areas between balls is ambiguous as there is no unique definition of what a contact is [5]. Here we follow the framework of the alpha shape theory described above. The key step when applying this theory to measure a union of spheres is to derive the dual complex K of their centers (see above). Two spheres Si and Sj that are connected by an edge in K overlap, i.e. the distance between their centers is smaller than the sum of their respective radii. Based on this observation, we proposed the following definition of contacts between balls [5]: Definition: Two spheres and in a union of spheres are said to be in contact if and only if their centers and are connected by an edge in the dual complex of . The intersection between these two spheres is the union of two caps Ci:j and Cj:i these two caps are connected at the level of the plane that separates the Voronoi cells of Si and Sj. When the sphere Si is in contact with more than one sphere, say with spheres Sj and Sk, there is a possibility that the corresponding caps Ci:j and Cj:i overlap: this occurs when the triangle cicjck is part of the dual complex K. Figure   3A illustrates this problem. To remove the ambiguity in assigning the corresponding overlap region Ci:j;k to either the contact between Si and Sj or the contact between Si and Sk, we use the Laguerre Voronoi diagram on the surface of Si. Sugihara [2] extended the concept of Laguerre diagram in the plane to a Laguerre Voronoi diagram on the surface of a sphere. In his approach, the Laguerre distance from a point P to a circle Ci on the sphere is defined as the geodesic length of the tangent line segment from the point to the circle. Similar to the Voronoi diagram described above, this distance function creates Laguerre Voronoi regions for each cap and separates them with geodesic arcs (see Figure  3B for an example of the Laguerre Voronoi diagram of ten circles on a sphere). We note that many of the properties of the weighted Voronoi diagram remain true in its spherical version. For example, if two circles intersect in two points, their Voronoi edge contains these two points.
The definition of contacts based on the alpha shape theory given above leads to the following additive property for all contact areas associated with a sphere Si: where A(Si) is the surface area of Si not covered by any other sphere, Ai:j is the contact area between the spheres Si and Sj, and the summation extends to all spheres Sj such that cicj is an edge in the dual complex K.
There is a trivial correspondence between the Laguerre Voronoi diagram of the caps on the surface of sphere Si and the set of simplices in the dual complex K that are associated with Si. For example, the two caps Ci:j and Ci:k overlap and share an edge in the spherical diagram if and only if the simplex sijk corresponding to the triangle formed by the centers of the three spheres Si, Sj, and Sk, belongs to K. This leads to the following inclusion-exclusion formula for the contact areas between a sphere Si and its neighbors: Here is the area of the contribution of Ci:k to the intersection of Ci:j and Ci:k, and is the common contribution of Ci:k and Ci:l to the intersection of Ci:j and Ci:k. The computations of the different types of terms on the right side of this equation involve simple spherical geometry [5]. In Figure 3D, we illustrate the computation of . We note that the definition of contacts between spheres given here is different from the standard definition based on local geometric proximity. Indeed, two spheres may overlap (i.e. be close in space) without being connected by an edge in K and therefore would not be considered in contact according to our definition (see Figure 4 for an illustration of this point). Our approach however is similar to the methods that define contacts in proteins based on the Voronoi diagram [88,89,92, 00].

Implementation
The theory described above provides a framework for measuring a union of spheres, i.e. computing its accessible surface area and enclosed volume, detecting its cavities and pockets, as well as for locating neighboring spheres in the union and defining their contacts. The implementation of this theory involves five steps: (i) compute the Delaunay triangulation, (ii) generate the dual complex, (iii) compute the surface area and volume using the Volume Theorem given above and the corresponding Area Theorem, (iv) detect pockets and cavities using the concept of flows described above, and (v) calculate individual contact areas using the contact definition described above. Several implementations of step (i) to (iv) are available, such as AlphaShape, CASTp [6, 0 ], and AlphaVol [7]. We have recently developed a new implementation of the same four steps that enables the analysis of very large molecular systems with millions of atoms, such as viral envelopes, available in the program UnionBall [96]. The addition of step (v) within the alpha shape theory is new and currently available in just one software package, BallContact [5].

Applications
The alpha shape theory provides an accurate and robust method for computing the geometric measures of a macromolecule. Among these measures, surface area and volume are used to quantify the interactions between such a molecule and the water surrounding it in implicit solvent models. The detection of pockets within a macromolecule and the determination of their sizes serve as a starting point for predictive studies of macromolecule-ligand interactions. In addition, the determination of internal atomic contacts allows for better characterization of atomic interaction and better definitions of solvation energies (see for example [ 02]. We provide illustrations of two of these applications of the alpha shape theory to study macromolecules, namely the characterization of pockets in ribosomes and the quantification of residue environment in protein structures. We then review recent applications of the alpha shape to study the geometry of large biomolecules and its relationship to function.

A(C i: j;k )
A(C i: j;kl )

A(C i: j;k )
The high-resolution structures of bacterial ribosomes, and those of their complexes with antibiotics, have significantly advanced our understanding of drug-RNA interactions, and paved the way for new antibacterial drug discovery and design, with the ribosome as a target. A prerequisite to drug design is the determination of the sites where the ligand may interact with its receptor. Binding sites of small molecule ligands are usually located in pockets (also referred to as clefts, or grooves) or cavities (i.e. pockets fully inaccessible to solvent) in the target macromolecule. As described in the previous section, the alpha shape theory provides the theoretical background that allows us to detect and measure these pockets. We have tested the performance of our own implementation, UnionBall [96], by checking if it is able to detect geometric pockets in the 30S subunit of the ribosome of Thermus thermophilus that are biologically relevant. The small ribosomal subunit is extensively studied as an antibiotic target, and there are at least eight structures of their complexes known [ 03]. We use the structure of the complex hygromycin B -30S as a reference (PDB code HNZ). Figure 5 shows the results of the application of UnionBall on the 30S ribosome. Note that all calculations were performed in the absence of an antibiotic molecule. We found that the deepest pocket, i.e. the largest pocket identified with a large alpha value, matches with the position of the antibiotics that binds to the 30S subunit of the ribosome.
It is common to characterize the structural environment of a residue in a protein from the secondary structure element it belongs to and its accessible surface area [ 04]. The former characterizes the local conformation of the residue, while the latter is used to quantify the surface area that was buried upon folding, as it is expected to differ for hydrophilic and hydrophobic residues. This has led to a quantification of the hydrophobic effect using the concept of a waterimplicit solvation free energy that is computed as a weighted sum of the accessible surface areas of all residues in a protein [ ]. We have extended this idea by accounting for the nature, and extent of, the inter-atomic contacts that are formed in the core of the protein as it folds [ 02,05]. Here we show why the nature of the inter-atomic contacts matters.
The fraction of the surface area of any atom that is in contact with solvent is called the solvent accessible surface area (ASA). In parallel, we define the polar contact surface area, or PCA, and the non-polar contact area, or NPCA, of an atom as its area in contact with (or occluded by) polar and non-polar atoms, respectively. In all analyses presented below, carbon and sulfur atoms were classified as non-polar atoms, while nitrogen and oxygen (neutral or charged) were classified as polar atoms. Note that PCA and NPCA should not be confused with the polar surface area and non-polar surface area, which commonly correspond to the accessible surface area of polar and nonpolar atoms, respectively. All surface areas mentioned above (i.e., ASA, PCA, and NPCA) were computed based on the alpha shape theory and its definition of contacts.
The calculation is performed with the program BallContact as follows. Each atom of the protein is represented as a ball, centered at the position of the atoms in the minimized structure for the protein, with a radius equal to RvdW+RH2O, where RvdW is the vdW parameter for the atom in AMBER94 and RH2O is the radius of the solvent probe, set to .4 Å. For an atom i in the protein, the program outputs its accessible surface area, ASA, as well as the list of atoms that are in contact and the corresponding contact areas. These atoms are then divided into two groups, those that are "near" (following the terminology of Shrake and Rupley [46]), i.e. that belongs to the same residue as i or to the backbone of the two flanking residues, and the others, named "long". Atoms that are "near" account for the stereochemistry of the residue to which atom i belongs and are not included in the subsequent calculations. Contact atoms that are "long" are further subdivided into polar and non-polar atoms, according to the definition above; the PCA and NPCA surface areas are then the sum of the corresponding contact areas.
We define the environment of a residue in a protein as the union of the accessible areas of its atom and of all their "long" contacts. This environment is then divided into an ASA, PCA, and NPCA. These three values correspond to sums of areas on spheres, given in Å 2 ; they are independent of each other. We define corresponding normalized values, XASA, XPCA, and XNPCA, according to: These three fractions of surface areas, expressed in percent, are no longer independent, as their sum is 00.
We collected data on the environments (accessible to solvent, polar, or non-polar) of 305604 residues in a database of highresolution protein structures [5]. The corresponding average results are shown in Figure 6.
We found that the non-polar environments of all twenty types of amino acids are weakly correlated to their accessible environments. This weak correlation illustrates that accessible surface area and contact areas provide complementary information that is relevant to the native conformations of the proteins. We note that the plot of non-polar contact area versus accessible surface area partitions the amino acids into two groups, those with low ASA and high NPCA, namely C, V, I, L, M, Y, F, W, and the others; this partitioning parallels the groupings of amino acids as being either hydrophobic or hydrophilic.
The alpha shape theory was originally developed in the early 990s by Edelsbrunner and co-workers to characterize the shapes of sets of points (weighted or not) in 2D and 3D [67, 06, 07]. As weighted points can be seen as balls, and as molecules are usually represented as union of balls, it was not surprising to see alpha shapes being adapted to characterize the shapes of molecules. The first applications focused on measuring molecular shapes (i.e. computing their volume and surface area) [95] as well as on characterizing the "empty spaces" enclosed within the boundary of a molecule, namely cavities [84] and pockets [6,85]. While these applications of the alpha shape theory remain popular in structural biology with new and improved software implementations being released regularly, such as

Statistics of protein structure geometry
Proteins are essential tools that perform a wide variety of biological functions inside the cell. Just like in the case of macroscopic tools, it is the shape and dynamics of a protein that define its function. Recent structural genomics initiatives have undertaken the vast challenge of finding the structures of all known proteins, in hopes of unraveling this relationship between geometry and function. The experimental determination of a protein structure at the atomic level remains, however, a difficult problem. There is hope however that theoretical and computational techniques will supplement experimental methods and enable protein structure prediction at the near atomic level [ 09, 0]. Many of these techniques rely on the knowledge derived from the analysis of the geometry of known protein structures. Such an analysis requires an objective definition of atomic packing within a molecular structure. The alpha shape theory has proved a useful approach for deriving such a definition. Singh et al for example used the Delaunay complex to define nearest-neighbors in protein structures and to derive a four-body statistical potential [ ]. This potential has been used successfully for fold recognition, decoy structure determination, mutant analysis, and other studies (for a full review, see [ 2]). The potentials considered in these studies rely on the tetrahedra defined by the Delaunay triangulation of the points representing the atoms. In parallel, Zomorodian and colleagues have shown that it is possible to use the alpha shape theory to filter the list of pairwise interactions to generate a much smaller subset of pairs that retains most of the structural information contained in a proteins [ 3]. The alpha shape theory has also been used to characterize the shapes [ 4] and surfaces [5][6][7].

Protein structure alignment
The alpha shape theory allows for the detection of independent simplices characterizing the geometry of a protein structure. It is worth mentioning that it is possible to use this information to compare two protein structures and even to derive a structural alignment between these structures [ 8,9]. Characterizing and predicting bio-molecular interactions As the function of a protein is related to its geometry and as function usually involves binding to a partner protein, significant efforts have been put into charactering the geometry of protein-ligand interactions, where ligands include small molecules, nucleic acids, and other proteins. Among these efforts, a few relate to the applications of the alpha shape theory. As described for example in Figure 5, the latter has been used extensively for detecting pockets and cavities within molecules that are putative binding sites [6,0 ]. It has been recently extended to characterize binding sites at the surface of proteins [ 5-7, 20, 2 ]. The alpha shape theory has also been used to characterize the interfaces in protein-protein complexes [ 22] as well as protein-DNA interactions [ 23]. For a complete review of the applications of the alpha shape theory to characterize protein interactions, the reader is referred to [ 24].
It is worth mentioning a geometric parallel between finding a structural alignment between two proteins and predicting the structure of their interactions. While the former is based on the identification of similar geometric patterns between the two structures, the latter is based on the identification of complementary patterns between the surfaces of the two structures. As mentioned above, geometric patterns based on the Delaunay triangulation have been used for structural alignment. In parallel, similar patterns have recently been used to predict protein-protein interactions [ 25].
Alpha shapes as a tool to characterize dynamics All the applications described above relate to the static geometry of molecules. Bio-molecules however are dynamics. A molecular dynamics simulation is designed to capture this dynamics: it follows the Newtonian dynamics of the molecule as a function of time, generating millions of snapshots over the course of the trajectory [ 26]. The alpha shape theory has proved useful to characterize the geometric changes that occur during such a trajectory. For example, using the concept of topological persistence [ 27], Kasson et al characterized structural changes in membrane fusion over the course of a simulation [ 28]. More recently, Lindow et al proposed a a Voronoi-based algorithm to extract the geometry and the dynamics of cavities and channels from a molecular dynamics trajectory [ 29].

Summary and Outlook
The Alpha Shape Theory provides a fast, accurate, and robust method for characterizing the geometry of a macromolecule represented as a union of balls. In this paper, we have presented the mathematical foundations of this theory and described its applications to measuring the shape of a molecule. We have shown how it can be used to compute the volume and surface area of a union of balls, to detect and measure cavities and pockets inside the outer envelope of such a union of balls, and to compute the surface areas of the contacts between the balls. We have reviewed how these measures are related to properties of the molecule of interest, as well as recent applications of the alpha shape theory that go beyond studying the geometry of a single molecule. We conclude this paper with a description of one new challenge in biology in which the alpha shape theory is expected to prove useful.
Recent advances in structural biology have produced an abundance of data on large macro-molecular complexes such as the RNA polymerase transcription complexes, the ribosome complexes, as well as large viral particles with more than sixteen million atoms. Modeling the dynamics of such large systems is as important as modeling smaller proteins. It becomes impractical, however, to consider all atoms of such molecular machinery and we need to introduce approximations that consider the system at coarser levels of detail. One possible approach is to represent the macro-molecular complex with a small number of spheres, supplemented with a model for their interactions that captures the physics of the underlying atomic model. This model will include a potential energy function for internal interactions and a potential energy function to account for the solvent environment of the system. We expect the latter to resemble the solvation potentials described in these papers that relate geometry and energy. We also expect the alpha shape theory, which provides full characterization of union of balls or spheres, to play an important role in both characterizing the coarse-grained representations of these molecular machines and in developing the models for their interactions.