Linear indices of the “ macromolecular graph ’ s nucleotides adjacency matrix ” as a novel approach in bioinformatics studies . 1 . Prediction of paromomycin ’ s affinity constant with HIV-1 Ψ-RNA packaging region

The design of novel anti-HIV compounds has now become a crucial area for scientists around the world. In this paper a new set of macromolecular descriptors (that are calculated from the macromolecular graph’s nucleotide adjacency matrix) of relevance to nucleic acid QSAR/QSPR studies, nucleic acids’ linear indices. A study of the interaction of the antibiotic Paromomycin with the packaging region of the HIV-1 Ψ-RNA has been performed as example of this approach. A multiple linear regression model predicted the local binding affinity constants [log K (10M)] between a specific nucleotide and the aforementioned antibiotic. The linear model explains more than 87% of the variance of the experimental log K (R = 0.93 and s = 0.102x10 M) and leave-one-out press statistics evidenced its predictive ability (q = 0.82 and scv = 0.108x10M). The comparison with other approaches (macromolecular quadratic indices, Markovian Negentropies and ́stochastic ́ spectral moments) reveals a good behavior of our method.


Introduction
The number of new discovered genomes has dramatically increased in recent years and this has once again highlighted the problem of protein and nucleic acid functions. 1,2 The complete sequencing of the genomes of various species will undoubtedly contribute to a better understanding of it's evolution. Public databases such as GenBank are growing in size at an exponential rate. 1 A significant proportion of the data corresponds to genomic sequences containing the structures not only of many genes but also of RNA. 3,4 The study of the interactions of drugs with biomolecules is now the hot topic in modern bioinformatics. This kind study constitutes a significant step towards rational drug design. In this sense, the use of footprinting techniques has proven to be an important experimental method for the discovery of significant processes in molecular biology and specifically the field of genomics. [5][6][7][8][9] The interactions between aminoglycosides and the packaging region of type-1 HIV (Human Immunedeficiency Virus) appear to represent a promising route for antiviral discoveries. 10 Aminoglycoside drugs are cationic natural products that interact with RNA. 11 The bactericidal effects inherent in these compounds stem from their ability to block protein synthesis by binding to the A-site on ribosomal RNA. 12 Recently, a novel scheme to the rational -in silico-molecular design (or selection/identification of chemicals) and to QSAR/QSPR studies has been introduced by our group. The so-called TOpological MOlecular COMputer Design (TOMOCOMD). 13 This method generates molecular fingerprints based on the Discrete Mathematic and Linear Algebra Theory. In this sense, atom, atomtype and total quadratic and linear molecular fingerprints have been defined in analogy to the quadratic and linear mathematical maps. 14,15 This approach has been successfully employed in QSPR and QSAR studies, [14][15][16][17][18][19][20][21][22][23][24] including studies related to nucleic acid-drug interactions. 25 The TOMOCOMD-CARDD (acronym of the Computed-Aided 'Rational' Drug Design) strategy is very useful for the selection of novel subsystems of compounds having a desired property/activity, [22][23][24] which can be further optimized by using some of the many molecular modeling methods available for medicinal chemists. The method has also demonstrated flexibility in relation to many different problems. In this sense, the TOMOCOMD-CARDD approach has been applied to the fast-track experimental discovery of novel anthelmintic compounds. 22,24 The prediction of the physical, chem-physical and chemical properties of organic compounds is a problem that can also be addressed using this approach. 14,19,21 Codification of chirality and other 3D structural features constitutes another advantage of this method. 20 This latter opportunity allows the description of the significance-interpretation and the comparison to other molecular descriptors. 15,19 Additionally, promising results have been found in the modeling of the interaction between drugs and HIV packaging-region RNA in the field of bioinformatics using TOMOCOMD-CANAR (Computed-Aided Nucleic Acid Research) approach. 25 Finally, an alternative formulation of our approach for structural characterization of proteins was carried out recently. 26 This extends methodology [TOMOCOMD-CAMPS (Computed-Aided Modelling in Protein Science)] was used to encompass protein stability studies -specifically how alanine scan on Arc repressor wildtype protein affects protein stability-by means of a combinations of protein quadratic indices (macromolecular fingerprints) and statistical (linear and non-linear models) methods. 26 Therefore, describing an extended TOMOCOMD-CANAR approach to account for RNA structure constitutes the main aim of this paper. In the present study, we propose a total and local definition of nucleic acid linear indices of the "macromolecular graph's nucleotides adjacency matrix". Besides of, the present work is focused on developing quantitative structure property relationships to predict the affinity with which paromomycin binds to the HIV-1 Ψ-RNA packaging region and compare our results with other cheminformatic methods previously reported.

Computational methods
A nucleic acid is a long, unbrached polynucleotide -i. e. a polymer consisting of nucleotides.
Each nucleotide has the three following components: 1) A cyclic five-carbon sugar, 2) a purine o pyrimidine base attached to the 1'-carbon atom of sugar by N-glycoside bond, and 3) A phosphate attached to the 5'-carbon of the sugar by a phosphoester linkage. The nucleotides in nucleic acids are covalently linked by a second phosphoester bond that joins the 5'-phosphate of one nucleotide and the 3'-OH group of the adjacent nucleotides. The purine and pyrimidine bases are not engaged in any covalent bonds to each other. Thus, a polynucleotide consists of an alternating sugar-phosphate backbone and each nucleotide is characterized by the base attached to it, which can be either adenine (A), cytosine (C), guanine (G) or thymine (T) [RNA molecule contains the base uracil (U) instead of T]. Consequently, a RNA molecule is uniquely determined by the sequence of bases along its chain, and it has a definite orientation. [27][28][29][30] In particular, a typical RNA is the single-stranded polyribonucleotide. This macromolecule has a folded 3D conformation that is held together in part by non-covalent base-pairing interactions like those that hold together the two stands of the DNA helix. In the single-stranded RNA molecule, however, the complementary bases pairs form between nucleotides residues in the same chain, which causes the RNA molecule to fold up in a unique way that is important for its biochemical activity. In this sense, the RNA structure contains several sets of unpaired nucleotide residues. Most of the weak interactions (hydrogen bonds) form between Watson-Crick complementary bases (between pairs of non-consecutive bases), i.e., between A and U and between C and G, but a far from negligible amount of bonds also form between other pairs of bases, as for example the G . U wobble pairs. [27][28][29][30]  On the other hand, the general principles of the molecular linear indices of the "molecular pseudograph`s atom adjacent matrix" for small-to-medium sized organic compounds have been explained in some detail elsewhere. [14][15][16][17]20 However, this work gives an extended overview of this approach.
First, in analogy to the molecular vector X used to represent organic molecules, we introduce  Table 1 depicts nucleotides (bases) descriptors properties for the DNA-RNA bases.
This approach allows us encoding RNA sequences such as AGUCACGUA through out the macromolecular vector X m = [0.28, 0.20, 0.18, 0.13, 0.28, 0.13, 0.20, 0.18, 0.28], in the f 1 -scale (see Table 1). This vector belongs to the product space ℜ 9 . The use of other AND-ARN bases properties defines alternative macromolecular vectors.

Local (nucleotide) nucleic acid's linear indices of the "macromolecular graph's nucleotide adjacency matrix"
If a protein consists of n nucleotides (vector of ℜ n ), then the k th nucleic acid's linear indices, basis as shown in Eq. 1, where, k a ij = k a ji (symmetric square matrix), n is the number of nucleotides of the nucleic acid and m X j are the coordinates of the macromolecular vector (X m ) in a system of basis vectors of ℜ n . The coordinates of the same vector will be different according to the basis vectors chose. [32][33][34][35] The values of the coordinates depend thus in an essential way on the choice of the basis. With the so-called canonical ('natural') base, e j denote the n-tuple having 1 in the j th position and 0's elsewhere. In the canonical basis, the coordinates of any vector X coincide with the components of this vector. [32][33][34][35] For that reason, those coordinates can be considered as weights of the vertices (ADN-ARN bases) of the graph of the nucleic acid's backbone. Table 2. A close up to the mathematical definition of total (RNA fragment) and local (nucleotide) nucleic acid linear indices of the "macromolecular graph's nucleotide adjacency matrix" of a RNA fragment.  ∈ℜ 13 In the definition of X m , as macromolecular vector, the symbol of the bases is used to indicate the corresponding DNA-RNA bases property, for instance, f 1 . That is: if we write A it means f 1(A), adenine first oscillator strength values or some bases property, which characterizes each nucleotide in the nucleic acid molecule. So, if we use the canonical bases of ℜ 13 , the coordinates of any macromolecular vector X m coincide with the components of that macromolecular vector.      The elements a ij are defined as follows: Equation (1) for f k (x mi ) can be written as the single matrix equation: where [ m X] is a column vector (a nx1 matrix) of the coordinates of X m in the canonical base of ℜ n and M k the k th power of the matrix M(G m ) of the macromolecular pseudograph G m (map's matrix). Table 2 exemplifies the calculation of f k (x m ) for a secondary structure RNA fragment.

Total (whole-molecule) linear indices of the "macromolecular graph's nucleotide adjacency matrix"
Total nucleic acid's linear indices are a linear functional on ℜ n . [15][16][17]24,25 That is, the k th total nucleic acid's linear indices are a linear maps from ℜ n to the scalar ℜ [ f k (x m ): ℜ n → ℜ ]. The mathematical definition of these molecular descriptors is the following: where n is the number of nucleotides and f k (x mi ) are the nucleic acid's linear indices (linear maps) obtained by Eq. 1. Then, a linear form f k (x m ) can be written in matrix form, or for all macromolecular vector X m ∈ ℜ n . [u] t is a n-dimensional unitary row vector. As can be seen, the k th total linear indices are calculated by summing the local (nucleotide) linear indices of all nucleotides in the protein.

Local (nucleotide-type) nucleic acid's linear indices of the "macromolecular graph's nucleotide adjacency matrix".
In In any case, a complete series of indices performs a specific characterization of the chemical structure. The generalization of the matrices and descriptors to "superior analogues" is necessary for the evaluation of situations where only one descriptor is unable to bring a good structural characterization. 36 The local macromolecular indices can also be used together with total ones as variables for QSAR/QSPR modeling for properties or activities that depend more on a region or a fragment than on the macromolecule as a whole.

Results and discussion
The data set of footprinted and binding nucleotides was extracted from the literature. 37 Figure 1 depicts the secondary structure of the HIV-1 Ψ-RNA packaging region as well as the binding sites of  Predictability and stability of the model (8) to data variation is carried out here by means of LOO cross-validation. The model shows a cross-validation standard error of only 0.108. In Table 3, we depict the observed, predicted and predicted after the LOO cross-validation procedure values of Log K obtained from Eq. 8.
Two of present authors reported a similar equation (see Eq. 9) using local (nucleotide) quadratic indices. 25 In the development of the quantitative model for the Log K description they detect one nucleotide (A276) as statistical outlier. This equation is given bellow with their statistical  Table 4 show a comparison with these approaches previously described.

Conclusions
Although there have been many discoveries in the last years in the field of bioinformatics, it is necessary the definition of novel macromolecular descriptors that could explain different biomacromolecular properties by means of a QSAR approach. In this sense, the approach described here represents a novel and very promising method for bioinformatics research. It presents a new set of macromolecular descriptors that are calculated from the macromolecular graph's nucleotide adjacency matrix. We have shown here that the use of the local (nucleotide) nucleic acid linear indices is able to depict the affinity with which paromomycin binds to the HIV-1 Ψ-RNA packaging region. The resulting model is significant of the statistical point of view. A LOO cross-validation experiment revealed that the QSAR model had a good predictability. The satisfactory comparative result showed that nucleic acid linear indices used here will be a novel chem & bioinformatics tool for further research.

Footprinting Data
The data set of footprinted and binding nucleotides was extracted from the literature. 37 Figure 1 depicts the secondary structure of the HIV-1 Ψ-RNA packaging region as well as the binding sites of Paromomycin. A representation of the Ψ-RNA appears along with a summary of binding/enhancement information for Paromomycin. The RNA consists of the 'main stem', positions 213-238 and 361-388; SL-1, which contains the dimmer initiation site; SL-2, having the 5' splice donor site; SL-3, and SL-4, the latter contains the start codon (AUG) for the gag gene.

TOMOCOMD-CANAR Software
TOMOCOMD is an interactive program for molecular design and bioinformatics research. 13 The program is composed by four subprograms, each one of them dealing with drawing structures (drawing mode) and calculating 2D and 3D molecular descriptors (calculation mode). The modules are named CARDD (Computed-Aided 'Rational' Drug Design), CAMPS (Computed-Aided Modeling in Protein Science), CANAR (Computed-Aided Nucleic Acid Research) and CABPD (Computed-Aided Bio-Polymers Docking).
In this paper we outline salient features concerning with only one of these subprograms: CANAR. This subprogram bases on a user-friendly philosophy without prior knowledge of programming skills. The calculation of total and local (nucleotide) macromolecular linear indices for any nucleic acids was implemented in the TOMOCOMD-CANAR software. 13 The following list briefly resumes the main steps for the application of this method in QSAR/QSPR: 1. Draw the macromolecular graphs (G m ) for each RNA/ADN of the data set, using the software's drawing mode. Selection of the active nucleotide symbol carries out this procedure. Here, we consider only covalent interaction (phosphodiester bond) and hydrogen bond interaction (between complementary bases).
2. Use appropriated purine and pyrimidine bases weights in order to differentiate the residues in each nucleotide. This work uses as nucleotide weights five properties of DNA-RNA bases (see Table   1). 31 This parametrization is done using the properties of U, T, A, G, and C only, because the only uncommon part of these nucleotides are these bases.
3. Compute the nucleic acid linear indices of the "macromolecular graph's nucleotides adjacency matrix". They can be performed in the software calculation mode, which you can select the DNA-RNA bases properties and the family descriptor previously to calculate the macromolecular indices.
This software generates a table in which the rows and columns correspond to the compounds and the f k (x m ), respectively. 4. Find a QSPR/QSAR equation by using statistical techniques, such as multilinear regression analysis (MRA), Neural Networks (NN), Linear Discrimination Analysis (LDA), and so on. That is to say, we can find a quantitative relation between a property P and the f k (x m ) having, for instance, the following appearance, P = a 0 f 0 (x m ) + a 1 f 1 (x m ) + a 2 f 2 (x m ) +….+ a k f k (x m ) + c (12) where P is the measurement of the property, f k (x m ) [or f kL (x m )] is the k th total [or local] macromolecular linear indices, an the a k 's are the coefficients obtained by the statistical analysis.
5. Test the robustness and predictive power of the QSPR/QSAR equation by using internal and external cross-validation techniques.
6. Develop a structural interpretation of the obtained QSAR/QSPR model using macromolecular linear indices as molecular descriptors.

Statistical Analysis
Based on the discussion above, a simple linear model was proposed to predict drug-nucleotide affinity. Multiple Linear Regression (MLR) was used to obtain a quantitative model. This statistical analysis was carried out with the STATISTICA software package. 40  Forward stepwise was fixed as the strategy for variable selection. The tolerance parameter (proportion of variance that is unique to the respective variable) used was the default value for minimum acceptable tolerance, which is 0.01.
The quality of the MLR model was determined examining the statistic parameters of multivariable comparison of regression and cross-validation procedures. In this sense, the quality of the model was determined by examining the regression coefficients (R), determination coefficients (R 2 ), Fisher ratio's p-level [p(F)], standard deviations of the regression (s) and the leave-one-out (LOO) press statistics (q 2 , s cv ). 41