MartiTracks: a geometrical approach for identifying geographical patterns of distribution.

Panbiogeography represents an evolutionary approach to biogeography, using rational cost-efficient methods to reduce initial complexity to locality data, and depict general distribution patterns. However, few quantitative, and automated panbiogeographic methods exist. In this study, we propose a new algorithm, within a quantitative, geometrical framework, to perform panbiogeographical analyses as an alternative to more traditional methods. The algorithm first calculates a minimum spanning tree, an individual track for each species in a panbiogeographic context. Then the spatial congruence among segments of the minimum spanning trees is calculated using five congruence parameters, producing a general distribution pattern. In addition, the algorithm removes the ambiguity, and subjectivity often present in a manual panbiogeographic analysis. Results from two empirical examples using 61 species of the genus Bomarea (2340 records), and 1031 genera of both plants and animals (100118 records) distributed across the Northern Andes, demonstrated that a geometrical approach to panbiogeography is a feasible quantitative method to determine general distribution patterns for taxa, reducing complexity, and the time needed for managing large data sets.


Introduction
The geographic distribution of species has been considered an important source for documenting and conserving biodiversity [1]. Given the exponential growth of distributional data [2,3], the necessity for procedures and bioinformatics tools to facilitate data management, reduce data complexity, and find general patterns from distributional point records has increased.
In this context, different biogeographic approaches make use of tools, to manage and analyze these kind of data. Within these approaches panbiogeography is considered an important tool for the primary management of distributional data [4], because it focuses on the spatial or geographical component, as a fundamental precondition to any analysis of the patterns and processes of evolutionary change [5][6][7]. This evolutionary approach to biogeography was developed by Croizat [8][9][10], as a response to Darwin's biogeographic ideas on means of dispersal in geographic distribution [11].
Panbiogeography delimits distributional patterns for multiple species and is known as track analysis. This method is based on three graphic elements: individual tracks, generalized tracks, and nodes [5,7,12,13]. An individual track is made up of lines drawn on a map, on which different localities or distribution points of a particular taxon are connected, such that the sum of the segment lengths connecting all distribution points is the smallest possible. In graph theory, an individual track is a minimum spanning tree (hereafter MST) [5,14,15]. Generalized tracks, or standard tracks, are lines on a map resulting from overlapping individual tracks, as such, they are considered repetitive patterns summarizing the distributions of diverse individual taxa [16]. These patterns reflect an ancestral biota that has been fragmented by tectonic or climatic events [17]. Finally, nodes are areas where two or more generalized tracks overlap. These are complex areas or tectonic and biotic convergence zones [12,14,15,17]. Thus, these three elements (individual tracks, generalized tracks, and nodes) define the main steps of track analysis [14]. First, two or more individual tracks are calculated from geographic locality records, then generalized tracks are delimited through geographic congruence of individual tracks, and finally, nodes are identified as the intersection area(s) between generalized tracks. Different approaches exist within panbiogeographic methods. For example, Croizat's manual reconstruction [9,10], Page's spanning graphs [15], Craw's track compatibility [18], and PAE (''Parsimony Analysis of Endemicity'') [5,[19][20][21][22]. Nevertheless, there are few quantitative and automated approaches for mapping generalized tracks (e.g. Craw's compatibility track analysis [18,23]) with software implementations.
Considering that individual and generalized tracks are lines in a geometrical context, and congruence of individual tracks is a geometric property, in this study, we describe new software, named MartiTracks, based on a new algorithm to perform a panbiogeographic track analysis using a geometrical approach. The algorithm includes geometric functions and processes, which makes this approach a feasible quantitative alternative to the traditional track analysis. Finally, this approach is a unique and useful technique to capture distributional patterns or structures in studies employing spatial data.

Results
The general framework For a new MartiTracks project, distribution point records (latitude and longitude data) of a particular set of taxa must be compiled. A typical MartiTracks input file consists of a text file, which has the following structure: taxon-name, latitude, and longitude data. These data points are utilized to build an individual track for each species. The spatial congruence of the individual tracks is then evaluated through the congruence algorithm in order to determine whether there are generalized tracks representing the general patterns of distribution. Finally, the individual tracks of each species and the generalized patterns of distribution are represented in a KML (Keyhole Markup Language) file that can be visualized using any Geographic Information System (GIS) program such as GoogleEarth, or Qgis ( Figure 1).

First step: Minimum spanning trees (MST)
In the same way as most of panbiogeographic software, for example, Croizat [23], or Trazos2004 [24], MartiTracks initially creates an MST, representing an individual track. When two or more points are found at the same place, or are close enough to be considered the same sampling point, these points are reduced to a single point, using a minimum Euclidean distance parameter that we called cut value. Therefore, this parameter reduces initial redundancy in the data sets, speeding up the calculation of MSTs. Figure 1. MartiTracks' framework. The user specifies an input file containing species distributional data (latitude-longitude). Then, these geographic points are used to calculate a minimum spanning tree (MST) for each species. Finally, the MSTs are analyzed by the congruence algorithm in order to delimit general patterns of distribution. The output is a KML file. doi:10.1371/journal.pone.0018460.g001 Second step: Spatial congruence among species Spatial congruence between two MSTs. Once the individual tracks are defined, the panbiogeographic method determines the spatial congruence of the individual tracks in order to delimit generalized tracks representing general patterns of distribution. The geometrical approach of MartiTracks considers each MST' segment or edge as the basic unit of congruence between two species. Thus, given an individual track or MST as MST~(v,e) involving a set v of vertices together with a set e of edges, a segment s i belonging to MST a is defined as the edge e i connecting two endpoint vertices v i (Figure 2).
The core of MartiTracks' geometrical approach is the function that calculates the shortest distance from a point to a segment. This function was developed by Paul Bourke and can be found at http:// local.wasp.uwa.edu.au/pbourke/geometry/pointline/. Given segment (P1-P2) and point P3 (Figure 3), the distance d, from point P3 to segment P1-P2 is defined as the distance between point P3 and the intersecting point P, resulting from the perpendicular extension of P3 towards segment P1-P2. If there is no intersecting point from   the perpendicular extension of P3, the function will take the shortest distance from point P3 to either endpoint of segment P1-P2.
Given two segments s a , and s b belonging to species a and b, respectively, we consider that these two segments are congruent if any of the vertices v i in segment s a has an intersecting point P a on e j , or if any of the vertices v j in segment s b has an intersecting point P b on e i ( Figure 4A); or if both vertices v i in segment s a intersect on e j , or if both vertices v j in segment s b intersect on e i ( Figure 4B). If there are no intersecting points P a or P b on edges e j , and e i respectively, then segments s a , and s b are not congruent ( Figure 4C).
As congruence also depends on the Euclidean distances between segments and points, the maximum and minimum distances between segments are calculated in order to define two decision rules of congruence. Using these rules, two segments are congruent if the minimum, and maximum distances between segments do not exceed the predefined limits.
For the first rule, given s a , and s b belonging to the species a and b, respectively, where dmin is the minimum distance, dmax the maximum distance, lmin the boundary of the minimum distance, and lmax the boundary of the maximum distance. Two segments are congruent, if the first congruence condition is fulfilled ( Figure 4A), and if (0ƒdminƒlmin) and (0ƒdmaxƒlmax) are true (see Figure 5A).
The second rule is defined by the maximum distance within the spatial range. Given two segments s a , and s b belonging to species a and b, respectively, where dmax.line is the maximum distance within of the line segment, and lmax.line the boundary of the maximum distance within of the line segment. The two segments are congruent, if both v i in segment s a have intersecting points on e j or if both v j in segment s b have intersecting points on e i ( Figure 4B), and if (0ƒdmax.lineƒlmax.line) is true (see Figure 5B).
Finally, if two segments are found to be congruent, their points will be connected through a new MST. Then, each segment of species a is compared to all other segments of species b until the whole MST of species a has been compared. The same procedure is carried out from species b to a. If the congruence between two species is null, no tracks or new MSTs will be created.
Spatial congruence among MSTs. Therefore, the spatial congruence among MSTs is the criterion to define whether a generalized track exists; if a species is not congruent with the remaining species, no generalized tracks are generated. Once, all species are compared and some levels of congruence are detected, a generalized track is created. When the analysis is complete, some repeated tracks may result, which can be reduced to a unique solution by means of a similarity index (SI). This index (SI) measures the similarity between tracks (either individual, or generalized tracks), and depending on a pre-established threshold, determines whether two tracks can be considered as the same element or not. Given two MSTs a and b (Figure 6), the similarity index between them is calculated taking into account the length of their congruent segments, and the total length of MST a , and MST b .
SI ab = length of congruent segments MST ab /total length of MST a SI ba = length of congruent segments MST ba /total length of MST b .
It is important to emphasize that this is an asymmetrical index, due to its dependence on the length of the MSTs. Thus, SI ab is different to SI ba , because MST a is longer than MST b (Figure 6). Given i as the higher value between SI ab , and SI ba ; and min-SI as the predefined threshold value of SI, if (i §min-SI) the geographical points of the MST of species a, and b are joined, and they become part of the same MST.
Finally, the parameters cut value, lmin, lmax, lmax.line, and min-SI can be predefined according to the user's required level of congruence. It is important to consider that the value of each parameter of congruence depends on the value of the other parameters. Similarly, there is a constraining rule for these values, hence the cut valuevlminvlmaxvlmax.line. Figure 6. Similarity index (SI). MartiTracks calculates the similarity among tracks through a similarity index (SI). Given two tracks a, b (either individual, or generalized tracks), the similarity index SI ab is the length of the congruent segments from a to b divided by the total length of the MST a . In the same way the similarity index SI ba is the length of the congruent segments from b to a divided by the total length of the MST b . doi:10.1371/journal.pone.0018460.g006

Empirical analyses
Panbiogeographical analysis of the genus Bomarea (Alstroemeriaceae). An empirical analysis was conducted with 2340 records belonging to 61 species of the genus Bomarea, obtained from the Global Biodiversity Information Facility GBIF ( http://www.gbif.org/datasets/resources/ 24/07/2010). We used three different sets of parameters values in order to calculate the general distributional patterns of Bomarea with different levels of congruence. The generalized tracks obtained by one of the sets are shown in Figure 7A.
To run the program, we used a PC-compatible computer with an Intel Core 2 Quad Q6600 at 2.40 GHz and 4 GB of RAM, running Ubuntu 9.04 64 bits. The panbiogeographical analyses of the genus required 30 to 60 seconds.
These results were compared to the results of a previous panbiogeographic work on Bomarea, using a traditional panbiogeographic analysis by Alzate et al. [25]. In contrast to our analysis, Alzate et al. used 2205 records belonging to 101 species of the genus Bomarea. Although there is a difference between the number of species evaluated in both analyses, similar patterns of distribution were found ( Figure 7B).

Panbiogeographical analysis from the Northern
Andes. We analyzed 100118 georeferenced records belonging to 1031 genera of plants and animals, distributed across the Northern Andes, in order to evaluate MartiTracks efficiency with large data sets. This data set was obtained from the Global Biodiversity Information Facility GBIF (http://www.gbif.org/ datasets/resources/26/06/2009), and was not filtered for errors in distributions or taxonomy; therefore, mimicking an exploratory analysis to evaluate a very large data set. Four parameter sets were employed to visualize general patterns of distribution with different levels of congruence.
Depending on the parameters used, analyses of the Northern Andes data generated several patterns including 3 to 27 generalized tracks. Figure 8 shows the three general patterns found with one of the parameter sets evaluated. The analyses required 15 to 30 minutes. These results prove the outstanding ability of MartiTracks to reduce data complexity and to find common distribution patterns with large data sets within a reasonable processing time.

Discussion
As the amount of geographical information rapidly grows, the necessity for bioinformatic tools, able to deal with this kind of data, has increased. For panbiogeographical analyses, MartiTracks is a feasible quantitative alternative to traditional track analysis (e.g. Manual reconstruction or Craw's compatibility track analysis). Consequently, the ambiguity and the subjective factor, produced when overcrowded geographical points are evaluated [26,27], are eliminated from the analyses. Another significant advantage of MartiTracks is that the geometrical approach eliminates large amount of time needed for analyzing large data sets as shown in the Northern Andes analysis. Thus, a single computer could easily deal with data sets involving ten of thousands of geographical records. Finally, by setting different distance parameters, which define the level of congruence, the users can explore several levels of resolution for analyzing their data sets according to their requirements.

Materials and Methods
MartiTracks was written in Freepascal language under the Unix Operative System, Linux -Ubuntu 10.04 64 bits. Compiled versions of the program for Windows and Linux platforms, along with the source code are freely available under a GNU General