GEO-Nav: a geometric dataset of voltage-gated sodium channels

Voltage-gated sodium (Nav) channels constitute a prime target for drug design and discovery, given their implication in various diseases such as epilepsy, migraine and ataxia to name a few. In this regard, performing morphological analysis is a crucial step in comprehensively understanding their biological function and mechanism, as well as in uncovering subtle details of their mechanism that may be elusive to experimental observations. Despite their tremendous therapeutic potential, drug design resources are deficient, particularly in terms of accurate and comprehensive geometric information. This paper presents a geometric dataset of molecular surfaces that are representative of Nav channels in mammals. For each structure we provide three representations and a number of geometric measures, including length, volume and straightness of the recognized channels. To demonstrate the effective use of GEO-Nav, we have tested it on two methods belonging to two different categories of approaches: a sphere-based and a tessellation-based method.


Introduction
Voltage-gated sodium (Nav) channels are transmembrane proteins consisting of a pore-forming pseudotetrameric α subunit and, in most vertebrate cells, one or smaller (non-pore-forming) β subunits. Nav channels play an essential role in electrically excitable cells such as neurons, myocytes and endocrine cells: they respond to changes in membrane potential -the difference in electric potential between the interior and the exterior of a cell -by opening and closing ion-selective pores, allowing the flow of positively charged sodium ions from the region of higher concentration to the area of lower concentration (usually, from the exterior to the interior of the cell) [1]. It is worth noting that there is a background of deliberate ambiguity when using the word channel. Not only is the term used for mentioning the whole protein (i.e., the transmembrane protein), but it also refers to a cavity -i.e., a connected component of the complement space of the protein inside its convex hull -with two entrances on opposite sides of a molecule (see, for example, [2]).
Biological background and motivation There are three primary states in Nav channels, see [3] and Figure 1: • Closed. When the cell is at rest, activation gates are closed, sodium ions are prevented from going through, and the membrane potential is maintained around some fixed negative valuee.g., around -70 mV in most neurons.
• Open. Because of an approaching action potential, the membrane potential rises (in jargon, it depolarizes) -e.g., to about -55 mV in neurons; as a response, activation gates open, allowing Na + ions to flow into the cell and the membrane potential to rise further.
• A number of routines to study the centerlines and maximal inscribed ball radius values along them: centerline length and straightness, and channel volume. Two comparison measures are also provided.
• Quantitative and qualitative analyses of the output, together with a comparison that highlights the strengths and weaknesses of the two approaches.
Organization The remainder of the paper is organized as follows. Section 2 briefly discusses existing databases and details how this dataset differs from those presented in the previous editions of the Symposium of 3D Object Retrieval. Section 3 describes how the dataset was generated and what files are available. Two geometric algorithms, summarized in Section 5, are evaluated and compared on the GEO-Nav dataset on the basis of measures presented in Section 4; results and considerations derived from such an analysis are presented in Section 6. Concluding remarks end the paper.

Related datasets and databases
Datasets presented to the previous editions of the Symposium of 3D Object Retrieval [18,19,20,21,22,23] focused on the retrieval, classification and segmentation of small molecules with no restriction on their function. We here consider voltage-gated sodium channels, macromolecules counting approximately 2000 amino acid residues organized in four homologous domains, and a detailed description of the geometry of each channel -including centerline, volume, length and straightness.
Although some comprehensive databases, such as PubChem [24], ChEMBL [25], BindingDB [26] and VGSC-DB [27] collect several relevant data on Nav channels, they are either general-purpose (i.e., they do not consider just channels but the more general case of cavities) or their focus restricts to the sole phisico-chemical information, neglecting the geometry of the problem. Moreover, such databases do not provide any sort of measure to test and compare the capability of computational approaches to study channels.

The dataset
Recognizing and characterizing voltage-gated sodium channels are key steps toward unravelling their intricate mechanisms and their involvement in various physiological and pathological conditions. The construction of a reliable and comprehensive dataset of sodium channels necessitates the output of domain experts, which is here incorporated by following some ground properties of what such channels are supposed to look like. The dataset here presented consists of: • 3 benchmarked (BM) models, synthetically created by placing carbons around straight segments so that the cavity resembles a cylinder of fixed radius (see Figure 2). While unrealistic if compared to voltage-gated sodium channels, they can be used as a sanity check given that their centerline and maximal inscribed ball radius values are known a priori (by construction). These 3 models are meant to differ from Nav channels for which some crystallographic structure (or molecular dynamics) may be known, but, at each instant, only the experimental estimate of the position of the atoms at a given instant is available.
• 21 PDB entries deposited in the Protein Data Bank archive [28], each one containing a Nav channel in different conformations experimentally determined via X-ray diffraction and cryoelectron microscopy. The structures refer to eight Nav channels: Nav1.1-Nav1.8. Mutations differentially expressed throughout the human body are associated with migraine (Nav1.1), epilepsy (Nav1.1-Nav1.3, Nav1.6), cardiac (Nav1.5), pain (Nav1.7-Nav1.8), and muscle paralysis (Nav1.4) syndromes [29]. A complete list is provided in Table 1, together with details about the specific structure and its resolution (a measure of the quality of the structure, which quantifies how much detail can be observed in the experimental data). The existence of a cavity that qualifies as a channel is certified by the metadata contained in the PDB repository as well as in the papers that introduced and analyzed these structures -which are listed in Table 1.
• 63 additional structures, three for each of PDB entries of the previous point. Each structure is obtained by applying a uniform random perturbation in [−1, 1] 3 A to the coordinates of those atoms that are sufficiently farther away from the centerline than a given threshold. Since we can't determine the centerline's exact location without making use of computational methods, we proceed manually by temporarily roto-translating the channel (here intended as the whole protein) as if the centerline is approximately aligned to the zaxis; we then consider a threshold distance of 10Å from the z-axis. Although this procedure might sound way overly subjective and prone to serious human error, a threshold distance of 10Å is sufficiently large to leave atoms determining the cavity untouched. It is indeed known that in Navs channels -more generally, in transmembrane proteins -channels have preferably straight centerlines [30]; additionally, it has been observed that Navs have maximal inscribed ball radius values roughly in 3.6-6Å, with average bottleneck radii ranging from 1.62 to 2.20 A [31,32].
This leads to a total of 84 structures, which are made available at the link https://github.com/ rea1991/GEO-Nav in three formats: PDB files, XYZR files, and OFF 2 files. PDB files are required by many of the tools from the bioinformatics community, e.g., from HOLE [7]. Before use, PDB files need to be preprocessed using a custom, freely-distributed, Python script 3 to remove ligands and Heterogeneous atoms (HETATM).
OFF files contain the molecular surfaces (MS) calculated and triangulated with the C++ sofware NanoShaper [17], choosing the Connolly Solvent Excluded Surface model [33,34]; default parameters are considered, with the sole exception of the probe radius which is set to 0.8Å. To produce OFF files, NanoShaper requires XYZR files: an XYZR file lists one atom per row, while columns contain the coordinates of the atomic centers and the atomic radii. To convert PDB files to XYZR files we have used a Python script based on the open-source package ProDy 4 for protein structural dynamics analysis and on the Amber99 force field [35]. Examples of protein surfaces produced by NanoShaper are provided in Figure 3.

Evaluation measures
When applying a geometric method to a PDB or OFF file containing a channel, the expected output consists of a collection of points representing the centerline of the Nav channel, together with the maximal inscribed ball radius values at such points. Specifically, each point p of the centerline C is endowed with its coordinates and a value ρ p representing the radius of the largest sphere centred at the considered point p and completely contained in the Nav channel.
We develop a collection of elementary tools which, given a molecular structure, aim at comparing the protein channel recognized by different computational methods in terms of centerlines and maximum inscribed ball radius values.

Measures for single-channel analysis
We are able to perform some elementary measures of the retrieved channels such as the length, the straightness, and the volume.
Assuming that the points of a centerline C are sorted from one entrance p 0 of the channel to the other entrance p N , the length value is simply retrieved by summing the distances between consecutive points of C: The straightness of a centerline C is measured as the reciprocal of the tortuousness of C, i.e., where the tortuousness is defined as the average distance of the points of C to the straight line ℓ passing through the first and the last point of C; the tortuousness of the centerline of a channel is a positive real number close to zero when C is rectilinear and assumes greater values as the path loses its rectilinear behaviour. As a consequence, high straightness values denote almost rectilinear centerlines, while low values notify a more curvilinear behaviour. Lastly, the volume of a channel can be simply retrieved by computing the volume of the region of space obtained as the union, for p varying in C, of the balls B(p, ρ p ) of radius ρ p centred at p. In formulas, this corresponds to 1dxdydz.

Comparison measures
Interesting information arises from comparing the output of different computational methods. Such an analysis can involve channels of different molecular structures or -as it will be more detailed discussed in Section 6 -channels recognized on the same molecule by different software tools.
A preliminary comparison between centerlines can be achieved by considering some of the previously introduced measures such as the length, the straightness, and the volume.
To make the comparison between channel centerlines more effective, we are interested in identifying portions of the centerlines common to both channels. This identification is measured by the function match. Given two centerlines C and C ′ , we consider a point p ∈ C matched with C ′ if there exists a point p ′ ∈ C ′ such that the Euclidean distance between p and p ′ is below a certain threshold value. We define match(C, C ′ ) as the percentage of points of C matched with C ′ .
Another comparison measure between two centerlines allows for evaluating the difference between the radius values assumed by the centerlines on their portion identified as matched. Specifically, the matching between two centerlines C and C ′ defines a correspondence γ : M → M ′ between two portions M and M ′ of C and C ′ , respectively. The distance d ρ (C, C ′ ) between the radius values of C and C ′ is defined, analogously to the L 2 -norm, as where we recall that ρ p denotes the largest sphere centered at p completely contained in the considered channel.

Methods
The proposed GEO-Nav dataset has been used to evaluate and compare two methods for the geometric study of voltage-gated sodium channels: HOLE and Chanalyzer.

HOLE
The software [7], implemented in FORTRAN-77, is de facto one of the standards in the analysis of static transmembrane channel proteins. HOLE relies on user-assisted cavity localization (UACL), i.e., it requires an initial point p, that lies anywhere within the channel, and a vector v, that is approximately in the direction of the channel.
The initial point is firstly adjusted by the Metropolis Monte Carlo simulated annealing procedure to find the sphere satisfying the following two requirements: (i) its center lies on the plane determined by p and v; (ii) it is the largest that does not overlap with the atoms bordering the channel. Here, an atom is regarded as a sphere of given center and with radius equal to the element's van der Waals radius [52].
Such a sphere -regarded as a probe sphere -is then rolled through the channel, while its radius is adjusted by the Metropolis Monte Carlo simulated annealing procedure. In practice, this is repeated by taking a small displacement in the direction of the vector v until the end of the channel has been reached: this happens when the accommodated sphere radius exceeds a user-defined value, whose default value is 5Å.
The process is restarted from p using the vector −v as direction. To properly sort the points of the centerline C, we develop and perform an algorithm based on tools of graph theory capable of interpreting C as a path from one entry of the channel to the other.

Chanalyzer
Chanalyzer [13] is a tool designed to detect and geometrically describe ion channels in molecular dynamics trajectories through methods from computational geometry. Chanalyzer proceeds in four steps, which are here summarized to give a general idea of the working principles: (1) extraction of the tetrahedral representation of the channel via the alpha shape theory; (2) projection of the tetrahedral representation of the channel onto the SES generated via NanoShaper; (3) skeletonization and extraction of source and target points; (4) centerline computation. The four steps are applied subsequently without the need for manual interaction.
Step 1: extraction of the tetrahedral representation of the channel Starting from a molecule described as a collection of three-dimensional balls, Chanalyzer builds its weighted Delaunay triangulation; then, it filters the simplices to discard those tetrahedra that (i) belongs to the dual complex, i.e., the set of simplices that are completely inside the molecule, or (ii) are ancestors of the complement of the convex hull with respect to a discrete-flow, which works according to the principle of a fluid flowing into a sink. The connected component having the largest volume corresponds to the volumetric approximation of the channel.
Step 2: projection of the tetrahedral representation of the channel onto the molecular surface After computing a triangular mesh approximation of the Solvent Excluded Surface via the software NanoShaper, a portion of the SES that roughly corresponds to the channel is extracted by discarding all those triangles having their barycenter outside the tetrahedral approximation.
Step 3: skeletonization and extraction of source and target points The mean curvature flow skeleton of the sub mesh -which corresponds to an approximation of the medial axis of the channel -is computed via the Computational Geometry Algorithms Library (CGAL [53]). Chanalyzer then extracts an approximated centerline of the channel by pruning the just-computed skeleton; it proceeds by maximizing a score that takes into account both the length of a candidate centerline and its rectilinear behaviour.
Step 4: centerline computation The two ends of the approximated centerline serve as source and target points of the Vascular Molecular ToolKit (VMTK [54]), a standard software package that can produce a more accurate version of the centerline. For a given molecular surface, the output of Chanalyzer includes the centerline as a list of three-dimensional points and the list of maximal inscribed ball radius values along them.

Experimental analysis
As a guiding example of the use of the dataset described in Section 3 and evaluation measures introduced in Section 4, we run here the tools HOLE and Chanalyzer described in Section 5 and compare their output.
As previously mentioned, the process of user-assisted cavity localization adopted by HOLE requires the user to provide the initial point and direction of the channel, which are then adjusted by a Metropolis Monte Carlo simulated annealing procedure. In order to better compare the two software tools, we have run HOLE twice for each structure of the dataset providing as initial information the two initial points and directions retrieved as the entrances of the channel by Chanalyzer; this has, additionally, avoided any need for manual interaction.
First, we run the two tools on the three benchmarked models introduced in Section 3 on which we can make quantitative estimates with respect to ground truth. Figure 4 visually depicts the results obtained by running Chanalyzer and HOLE (with respect to two different initial positions). Table 2 reports the average, minimum, and maximum values of the radius of the retrieved channels obtained by running Chanalyzer and HOLE on the benchmarked models. In consideration of the fact that for the benchmarked models the channel radius (i.e., the radius of the maximal inscribed ball) has been set to 1.4Å, we can observe that Chanalyzer achieves accurate average results and a small difference between the minimum and the maximum values. Differently, as shown in Figure 4, the channels returned by HOLE protrude outward from the model surface. This affects the obtained radius values. As confirmed by Figure 5, the radius values achieved by HOLE are correct for the central portion of the retrieved channel while they significantly grow at the channel extremities. A complete quantitative validation is provided in Table 3. It is clear that Chanalyzer returns shorter centerlines and lower volumes but with higher straightness parameters. Table 4 compares centerlines and radii estimated by the two tools in each of the three you models: from the numbers within we can conclude that centerlines provided by Chanalyzer correspond to a subpart of those extracted by HOLE, with just small differences in the radius values. A similar conclusion is reached from Figure  5.  Table 2.    Figure 5: Graphs of the radius functions of the centerlines of the models considered in Figure 4. For each structure, the centerline retrieved by Chanalyzer is depicted in blue while we represent in orange the one among the two produced by HOLE obtaining the higher matching score. Moreover, vertical dashed lines denote the extrema of the interval in which the two centerlines are identified as matched.
We then proceed with the remaining structures. Figure 6 visually depicts -for some selected structures -the results obtained by running Chanalyzer and HOLE. A complete validation is offered in Table 5, which collects the measure values of the channels obtained by applying the two tools on the 21 PDB entries of the proposed dataset.  The values in Table 5 provide useful information for characterizing the behaviour of the tools Chanalyzer and HOLE. In the vast majority of cases, Chanalyzer returns centerlines with lower length and volume values. On the other hand, this tool typically ensures higher values for the straightness parameter. Figure 6 gives a visual confirmation of this trend, revealing that HOLE usually produces longer centerlines protruding outward more than the ones retrieved by Chanalyzer which, however, can identify centerlines following a more rectilinear path. In Figure 6 it is also possible to note a marked similarity between the radius values associated with the points of the centerlines received as the output of the two tools; this trend will be confirmed by Table 6 and by Figure 7. Table 6 collects the values of the comparison measures introduced in Section 4 for the channels obtained by running Chanalyzer and HOLE on the 21 PDB entries of the proposed dataset. Figure  7 depicts the graphs of the functions associating each point p of a centerline with the radius ρ p .
The shown examples concern the models considered in Figure 6. For each model, the considered centerlines are the one retrieved by Chanalyzer and the one among the two produced by HOLE obtaining the higher matching score. Focusing on Table 6, the percentage of matching between the centerlines returned by Chanalyzer and HOLE can assume quite varying values (from a perfect matching for the structure 7K48 to a 0% matching for 7W9K). Apart from a few specific cases, the two methods produce output with good matching percentages. The values of match(Ch., ·) are almost always greater than the ones of match(·, Ch.). This is due to the fact that the centerlines returned by Chanalyzer are typically a portion of the ones retrieved by HOLE. Moreover, columns d ρ of Table 6 and Figure 7 reveal that, on the portion they have in common, the centerlines obtained by the two methods are characterized by quite similar radius values. Table 5 and Table 6 enable us to more rigorously describe the examples depicted in Figure 6. For structure 7K48, (except for the length) the channel identified by the methods is precisely the same. On the contrary, the channel retrieved by Chanalyzer for model 7W9K does not correspond to any of the two returned by HOLE. For structure 5EK0, there is a good match between the output of Table 6: Comparison measure values between the channels obtained by running Chanalyzer (Ch.) and HOLE (H.1 and H.2, with respect to two different initial positions) on the proposed dataset. Specifically, we report in columns match as the percentage of points of a centerline matched with a different centerline. In columns d ρ , we collect the distance between the radius functions of two centerlines on their portion identified as matched. Chanalyzer and the channel produced by the second performance of HOLE. Finally, even if on their common part the radius values are pretty similar, the channels identified for model 8FHD have a quite limited matching percentage. Table 7 summarizes the robustness of Chanalyzer and HOLE to local noise. Indeed, for each of the 21 structures retrieved from the Protein Data Bank repository, GEO-Nav is enriched with three additional structures that were created by adding uniform noise locally (see Section 3) and, therefore, it is also possible to assess the geometric robustness of algorithms. In this comparison, the table reports the smallest match values and the largest distances computed between the exact structures and each of the synthetic ones. This test highlights that Chanalyzer is markedly more robust than HOLE, even though some structure point to potential challenges.

Conclusions
In this paper we have presented GEO-Nav, a dataset of 86 voltage-gated sodium channels: 21 structures originate from the Protein Data Bank Repository, while the remaining 63 are produced by introducing some uniform random perturbation. To also yield quantitative estimation of the quality of a channel recognition method, we also synthetically created three channels modelled as carbon atoms arranged around an axis to form cylinders of known radius.
To construct the dataset, we started from Nav channel structures for which the PDB format is known (sometimes there are multiple PDB formats related to the same structure). As a contribution of this dataset, we provide for a good number of Nav channels a molecular surface (namely the SES), 5EK0 7K48 7W9K 8FHD Figure 7: Graphs of the radius functions of the centerlines of the models considered in Figure 6. For each structure, the centerline retrieved by Chanalyzer is depicted in blue while we represent in orange the one among the two produced by HOLE obtaining the higher matching score. Moreover, vertical dashed lines denote the extrema of the interval in which the two centerlines are identified as matched. offer perturbed surfaces to test the robustness of tools with respect to small perturbations in the atom displacement and provide measurements to evaluate the properties of the recognized channels. The dataset comes together with several geometric measures that are designed to quantitatively analyse the morphology of the channels, e.g., to study and compare the centerline of each channel and the maximal inscribed ball radius values along it. GEO-Nav has been tested using two methods that are publicly available -a sphere-based approach and a tessellation-based tool -showcasing its potential utility for further research and drug design efforts targeting these channels.
To the best of our knowledge, GEO-Nav is the first dataset that is specifically designed to study the geometric structure of Nav channels. The availability of a geometry-aware dataset provides valuable resources for advancing the understanding of protein channels in general and, consequently, of their biological function. As a future perspective, we plan to continue to enrich the study with other channels and pores, as well as to consider dynamic features.