Thermodynamics of Microarray Hybridization

Microarrays make the use of hybridization properties of nucleic acids to monitor Deoxyribonucleic acid (DNA) or Ribonucleic acid (RNA) abundance on a genomic scale in different types of cells. The hybridization process takes place between surface-bound DNA sequences the probes, and the DNA or RNA sequences in solution the targets. Hybridization is the process of combining complementary, single-stranded nucleic acids into a single molecule. Nucleotides will bind to their complement under normal conditions, so two perfectly complementary strands will bind to each other readily. Conversely, due to the different geometries of the nucleotides, a single inconsistency between the two strands will prevent them from binding.


Introduction
Microarrays make the use of hybridization properties of nucleic acids to monitor Deoxyribonucleic acid (DNA) or Ribonucleic acid (RNA) abundance on a genomic scale in different types of cells.The hybridization process takes place between surface-bound DNA sequences -the probes, and the DNA or RNA sequences in solution -the targets.Hybridization is the process of combining complementary, single-stranded nucleic acids into a single molecule.Nucleotides will bind to their complement under normal conditions, so two perfectly complementary strands will bind to each other readily.Conversely, due to the different geometries of the nucleotides, a single inconsistency between the two strands will prevent them from binding.
In oligonucleotide microarrays hundreds of thousands of oligonucleotides are synthesized in situ by means of photochemical reaction and mask technology.Probe design in these microarrays is based on complementarity to the selected gene or an expressed sequence tag (EST) reference sequence.An important component in designing an oligonucleotide array is ensuring that each probe binds to its target with high specificity.
The dynamics of the hybridization process underlying genomic expression is complex as thermodynamic factors influencing molecular interaction are still fields of important research [1] and their effects are not taken into account in the estimation of genetic expression by the algorithms currently in use.

State of the art
Many techniques have been developed to identify trends in the expression levels inferred from DNA microarray data, and recently the attention was devoted to methods to obtain accurate expression levels from raw data on the underlying principles of the thermodynamics and hybridization kinetics.The development of DNA chips for rapidly screening and sequencing unknown DNA segments mainly relies on the ability to predict the thermodynamic stability of the complexes formed by the oligonucleotide probes.
The thermodynamics of nucleic acids have been studied from different points of view.Wu et al. [2] analyze the temperature-independent and temperature-dependent thermodynamic parameters of DNA/DNA and RNA/DNA oligonucleotide duplexes.The differences between DNA polymer and oligonucleotide nearest-neighbour thermodynamic trends, and the salt dependence of nucleic acid denaturation allowed to SantaLucia [3] to show that there is length dependence to salt effects but not to the nearest-neighbour propagation energies.
An early study on DNA microarray hybridization [4] found that it was strongly dependent on the rate constants for DNA adsorption/desorption in the non-probe covered regions of the surface, the two-dimensional diffusion coefficient, and the size of probes and targets and also suggested that sparse probe coverage may provide results equal to or better than those obtained with a surface totally covered with DNA probes.A theoretical analysis of the kinetics of DNA hybridization demonstrated that diffusion was important in determining the time required to reach equilibrium and was proportional to the equilibrium binding constant and to the concentration of binding sites [5].
Newer studies on hybridization kinetics and thermodynamics reveal that perfect match sequences require less time to reach saturation than mismatches.The experimental results of Dai et al. [6] exhibit inverse temporal behaviour, resulting that surface-bound oligos hybridizing primarily with their perfect complement sequence tend to equilibrate more slowly than do those whose binding is dominated by mismatch duplexes.Considering the assumptions, it has been demonstrated [7] that the hybridization time can in fact increase the accuracy of expression ratios, and that this effect may be more dramatic for larger fold changes.Separation between specific and nonspecific binding events can avoid the confusion about what RNA hybridizes the probes.In this case analysis of the perfect match and mismatch intensities in terms of simple single-base related parameters indicates that the intensity of complementary MM introduces a systematic source of variation compared with the intensity of the respective PM probe [8].
The hybridization of nucleic acids was modelled [9] according with the supposition that the process of hybridization goes through an intermediate state in which an initial short contact region has a single-stranded conformation prior to binding.
The hybridization theory gave the possibility of developing models that can be used to obtain improved measures of expression useful for data analysis.Naef and Magnasco [10] propose a simpler model to describe the probe effect that considers only the sequence composition of the probes.They demonstrate that the interactions between nearest neighbours add much predictive power for specific signal probe effects.The stochastic model proposed by Wu and Irizarry [11] can be used to improve the expression measure or in the normalization and summarization of the data.
Thermodynamics of Microarray Hybridization 465

DNA hybridization
DNA is a nucleic acid that contains the genetic instructions monitoring the biological development of all cellular forms of life, and many viruses.DNA is a long polymer of nucleotides and encodes the sequence of the amino-acid residues in proteins using the genetic code, a triplet code of nucleotides.DNA it is organized as two complementary strands, head-to-toe, with the hydrogen bonds between them.Each strand of DNA is a chain of chemical "building blocks", called nucleotides, of which there are four types: adenine (A), cytosine (C), guanine (G) and thymine (T).Between the two strands, each base can only bond with one single predetermined other base: A with T, T with A, C with G, and G with C, being the only possible combination.
Hybridization refers to the annealing of two nucleic acid strands following the base pairing rule.As shown in Figure 1, at high temperatures approximately 90°C to 100°C the complementary strands of DNA separate, denature, yielding single-stranded molecules.Two single strands under appropriate conditions of time and temperature e.g.65°C, will renaturate to form the double stranded molecule.Nucleic acid hybrids can be formed between two strands of DNA, two strands of RNA or one strand of DNA and one of RNA.Nucleic acids hybridization is useful in detecting DNA or RNA sequences that are complementary to any isolated nucleic acid.Finding the location of a gene or gene product by adding specific radioactive or chemically tagged probes for the gene and detecting the location of the radioactivity or chemical on the chromosome or in the cell after hybridization is called in-situ hybridization.
In the same way, in microarray technology, hybridization is used in comparing mRNA abundance in two samples, or in one sample and a control.RNA from the sample and control are extracted and labeled with two different fluorescent labels, e.g. a red dye for the RNA from the sample population and green dye for that from the control population.Both extracts are washed over the microarray and gene sequences from the extracts hybridized to their complementary single-strand DNA molecule previously attached to the microarray.Then, to measure the abundance of the hybridized RNA, the array is excited by a laser.
In the oligonucleotide microarrays the hybridization process occurs in the same way, the only difference here is that the sequences to be laid over the chip are sequences of 25 nucleotides length, perfect complementary to same length sequence of the gene, PMperfect match, and sequences of 25 nucleotides length, designed to correspond to PM, but having the middle base -the 13 th one, changed by its complementary base, MM -mismatch, as in Figure 2. The MM probes give some estimates of the random hybridization and cross hybridization signals.One principle to be followed in the design of oligonucleotide arrays is ensuring that the probes bind to their target with high accuracy.When the two strands are completely complementary they will bind by a specific hybridization, as it can be seen in Figure 3. On the contrary if there are mismatches between the nucleotides of the strands and they bind, a process called non-specific hybridization or cross-hybridization occurs.The hybridization process has been studied from point of view of interaction between base pairs, the interaction with unintended targets and also from its kinetics processes.Because in practice the DNA chips are immersed in the target solution for a relatively short time, the arrival to equilibrium is not guaranteed.Yet full analysis of the reaction kinetics requires knowledge of the equilibrium state.An understanding of the equilibrium state is also necessary to identify the relative importance of kinetic controls of the performance of the DNA microarrays.The effect of the cross-hybridization on probe intensity is predictable in the oligonucleotide microarrays, and models for avoiding this have been developed [14], [15], [16] some aspects of it going to be described in the following section.

Technical factors affecting gene expression 4.1. Thermodynamics parameters
Black and Hartley [18] define enthalpy as the sum of the internal energy of a thermodynamic system plus the energy associated with work done by the system on the atmosphere, which is the product of the pressure times the volume, as in equation ( 1) Because enthalpy is a property, its value can be determined for a simple compressible substance once two independent, intensive thermodynamic properties of the substance are known, and the change in enthalpy is independent of the path followed between two equilibrium states In [18] the entropy, S, was defined using the following equation: where Q  is an amount of heat introduced to the system and T is a constant absolute temperature.Since this definition involves only differences in entropy, the entropy itself is only defined up to an arbitrary additive constant.
The following models to be described use the state function parameters, enthalpy and entropy.State functions define the properties of a thermodynamic state.In a change between two thermodynamic states, the change in value of the state function is given by the symbol  .
The standard enthalpy change, H   , is the difference in the standard enthalpies of formation between the products and the reactants.This state function is associated with changes in bonding between reactants and products.Changes in enthalpy during reactions are measured by calorimetry experiments.
The standard entropy change, S   , is the difference in standard entropies between reactants and products.Entropy is a measure of the degree of order in a chemical system due to bond rotations, other molecular motions, and aggregation.The more random a system (disorder), the greater the entropy is.The larger a structure, the more degrees of freedom it has, and the greater its entropy.

Interaction between pairs
The nucleic acid duplex stability can be endangered by the interaction between the nucleotide bases.Thermodynamics for double helix formation of DNA/DNA, RNA/RNA or DNA/RNA can be estimated with nearest neighbour parameters.Enthalpy change, H   , entropy change, S   , free energy change, G   , and melting temperature, Tm, were obtained on the basis of the nearest-neighbour model.
The nearest-neighbour model for nucleic acids, known as the NN model, assumes that the stability of a given base pair depends on the identity and orientation of neighbouring base pairs [3].Previous studies in NN model parameters were brought forth in [15] and [19].
In the NN model, sequence dependent stability is considered in terms of nearest-neighbour doublets.In duplex DNA there are 10 such unique internal nearest-neighbour doublets.Listed in the 5'-3' direction, these are AT/AT TA/TA AA/TT AC/GT CA/TG TC/GA CT/AG CG/CG GC/GC and GG/CC.Dimmer duplexes are represented with a slash separating strands in antiparallel orientation e.g.AC/TG means 5'-AC-3' Watson-Crick base-paired with 3'-TG-5'.
The total difference in the free energy of the folded and unfolded states of a DNA duplex can be approximated at 37 o , with a nearest-neighbour model: where For a specific temperature one can compute the total free energy using the values from Table 1.As described in [19] the melting temperature Tm is defined as the temperature at which half of the strands are in double helical and half are in the random-coil state.A random-coil state is a polymer conformation where the monomer subunits are oriented randomly while still being bonded to adjacent units.
For self-complementary oligonucleotides, the Tm for individual melting curves was calculated from the fitted parameters using the following equation: where R is the general gas constant, i.e. 1.987cal/K mol, the CT is the total strand concentration, and Tm is given in K.For non-self-complementary molecules, CT in equation ( 5) was replaced by CT/4.present a complicated sequence dependence.

Interaction with unintended targets
As seen in previous sections the major issue in microarray oligonucleotide technology is the selection of probe sequences with high sensitivity and specificity.It has been shown [22] that the use of MM probes for assessment of non-specific binding is unreliable.Since the duplex formation in solution has been studied using the nearest neighbour model [3], [15] the microarray design in terms of probe selection has been achieved by using a model based on the previously mentioned nearest neighbour model [16].The model of Zhang et al. presents some modification to the nearest neighbour model, firstly to assign different weight factors at each nucleotide position on a probe with the scope of reflecting that the binding parts of a probe may contribute differently to the stability of bindings, and secondly to take into account two different modes of binding the probes: gene specific binding, i.e. formation of DNA-RNA duplexes with exact complementary sequences, and non-specific binding, i.e. formation of duplexes with many mismatches between the probe and the attached RNA molecule.They called their model, the positional-dependent-nearest-neighbour model.

Thermodynamics of Microarray Hybridization 471
According with their method, the observed signal Iij for probe i in the probe set for gene j is modelled as: where B is the background intensity, Nj is the number of expressed mRNA molecules contributing to gene specific binding, N * represents the number of RNA molecule contributing to nonspecific binding, E and E * are the binding energies for gene specific and respectively nonspecific binding.These energies are calculated as the weighted sum of stacking energies: where k is the same as the stacking energy used in nearest neighbour model [15].
The positional-dependent-nearest-neighbour model appears to indicate that the two ends of a probe contribute less to binding stability according to their weight factors, see Figure 4. a).It also can be observed that there is a dip in the gene specific binding weight factors of MM probes around the mismatch position, probably due the mismatch which destabilizes the duplex structure.In Figure 4. b) it can be noted that stacking energies in the positionaldependent-nearest-neighbour model can give an explanation for the presence of negative probe pair signals.
This model, together with the nearest neighbour model solves the problem of binding on microarrays, but still there are factors that affect the gene expression measuring.One of them affects the process of competing adsorption and desorption of target RNA to from probe-target duplexes at the chip surface.

Derivation of the Langmuir isotherm
For molecules in contact with a solid surface at a fixed temperature, the Langmuir Isotherm, developed by Irving Langmuir in 1916, describes the partitioning between the gas phase and adsorbed species as a function of applied pressure.
The adsorption process between gas phase molecules, A, vacant surface sites, S, and occupied surface sites, SA, can be represented by the following chemical equation, assuming that there are a fixed number of surface sites present on the surface, as in Figure 5.
When considering adsorption isotherms it is conventional to adopt a definition of surface coverage (  ) which defines the maximum (saturation) surface coverage of a particular adsorbate on a given surface always to be unity, i.e. max  = 1.

Thermodynamic derivation
An equilibrium constant k can be written in terms of the concentrations of "reactants" and "products": where: [SA] is proportional to the surface coverage of adsorbed molecules, or proportional to  ; [S] is proportional to the number of vacant sites, (1 - ); [A] is proportional to the pressure of gas, P.
Thus it is possible to define another equilibrium constant, b:

Thermodynamics of Microarray Hybridization 473
Rearranging the equations (10) and (11) one can obtain the expression for surface coverage:

Kinetic derivation
The equilibrium that may exist between gas adsorbed on a surface and molecules in the gas phase is a dynamic state, i.e. the equilibrium represents a state in which the rate of adsorption of molecules onto the surface is exactly counterbalanced by the rate of desorption of molecules back into the gas phase.It should therefore be possible to derive an isotherm for the adsorption process simply by considering and equating the rates for these two processes.
The rate of adsorption will be proportional to the pressure of the gas and the number of vacant sites for adsorption.If the total number of sites on the surface is N, then the rate of change of the surface coverage due to adsorption is: The rate of change of the coverage due to the adsorbate leaving the surface (desorption) is proportional to the number of adsorbed species: In these equations, ka and kd are the rate constants for adsorption and desorption respectively, and p is the pressure of the adsorbate gas.At equilibrium, the coverage is independent of time and thus the adsorption and desorption rates are equal.The solution to this condition gives us a relation for  , equation ( 12), where ad bkk  .Here b is only a constant if the enthalpy of adsorption is independent of coverage.proportional to the fractions of occupied probes.The fraction of probe sites occupied by probe-target duplexes is then given by the differential equation:

Dynamic absorption model
For the initial condition   00   , equation (15) has the following solution: where bf Kkk  .
Using equation ( 16) Burden et al. estimate the measured fluorescence intensity I, with I0 as the background intensity at zero concentration, to be: At equilibrium, the intensity I(x) at target concentration x follows Langmuir Isotherm (12):

Modelling hybridization by thermodynamics
It is well known that hybridization processes may be seen under the point of view of general thermodynamic conditions [23], meaning that the hybridization probability of a given test segment will be defined by its thermodynamic conditions, i.e. by its hybridization temperature.Regarding this, one can state that hybridization process will respond to the dynamic equation: where P represents the number of oligonucleotides available for hybridization, T the concentration of free RNA target, C the number of bound complexes, kf and kb are the respective forward and backwards rate constants for the reaction.This equation has as a natural solution the following expression in the time domain: Thermodynamics of Microarray Hybridization 475 where K defined as in equation ( 16) is an equilibrium dissociation constant, and  denoting a characteristic time over which the system reaches equilibrium.
Recent studies [24], [25] confirm the hypothesis that the hybridization process for the each of the probe pairs follows a time model according to the one from Figure 7.This model of evolution predicts that the probability of hybridization will be almost zero if not enough time interval is provided for the experiment to take place, and that in the limit, if enough time is allowed saturation will take place.
A practical solution to the different hybridization dynamics can be solved by using multiple regressions to convey PM-MM probe pairs to equivalent thermodynamic conditions by processing diachronic hybridization experiments [26].
The last procedure will be explained in more detail in the following paragraphs.

Exponential regression model
From equation ( 20) one can assume that a model to solve the multiple regression problem implicit in this study will have the following form:   where a and b are parameters to be estimated adaptively using least square fitting and the gradient method.
Vertical least square fitting proceeds by finding the sum of the squares of the vertical deviations R 2 of parameters a and b: where:   is the estimation error incurred for each component.
With this notation equation (22) will became: The condition of R 2 to be at a minimum is that From equations ( 24), ( 25) and ( 26) one will obtain: A solution for equations ( 27) and (28) can be found using the gradient method.In this case the parameters are going to be computed adaptively: where , ik  is defined as in equation ( 23) and β is a parameter used as an adjust step.

Application for experimental data
The experimental part has been complemented with artificially simulated test probes used for algorithmic validation.A diachronic database was also being produced to estimate hybridization time constants for different gene segments.Considering these assumptions data records have been created from experimental data fitted by the above mentioned models.The time dynamics of hybridization for both probe sets and their profiles were evaluated at certain time intervals.
Firstly, the diachronic data distribution for an evolution from 0 to 30 minutes is shown in Figure 8 in both cases, for the PM probe set and the MM probe set, and in the following figures, i.e. Figure 9 and Figure 10, show this time evolution for 60 and 120 minutes is also shown following the model in equation (20).The next step on data analysis was to look at the probe profiles, at certain times.Figure 11 shows the regression parameters obtained for time constants.The profiles of the perfect and mismatch were extracted for two different time values underlining the fact that if enough time is allowed to some probes, the mismatches will also hybridize completely.Considering this and applying the regression algorithm, we observed that this algorithm searches for the matching values of expression levels of probes sets and for estimated values of perfect and mismatch probes.One of the steps of this iterative algorithm can be seen in Figure 12.Once the iterative process was complete, certain probes have reached their target.In the expression level estimation most of the perfect match probes obtained the expected values, while some of the mismatch probes did not reach their target, Figure 13.Similar results were obtained in the case of matching hybridization for time constants.

Conclusions
The thermodynamics of oligonucleotide hybridization processes where PM-MM results do not show the expected behaviour, thus affecting to the reliability of expression estimation, was studied in this chapter and the following conclusions were emphasized:


Modelling the hybridization process through thermodynamical principles reproduces exponential-like behaviour for each P-T segment pair.


The hybridization process should be confined to the time interval where linear growth is granted, this is, at the beginning of the exponential curve shown in Figure 6.


Adaptive fitting may be used to predict and regress expression levels on a specific test probe to common thermodynamic conditions.Time constants may be inferred from the regression parameters adaptively.


The main features of the PM-MM probe sets may be reproduced from probabilistic modelling.


It may be expected that more precise and robust estimations could be produced using this technique with diachronically expressed hybridization experiments.

Figure 2 .
Figure 2. Perfect Match -Mismtach probeset strategy.Sequence of 25-mer length complementary to the selected part of mRNA sequence form a Perfect Match probe, while the Mismatch probe is artificially created by changing middle base with its complementary.In an oligonucleotide array a gene is represented by 11 to 20 probes.(modified from [13])

Figure 3 .
Figure 3. Cross-hybridization on a nucleotide probe.In specific hybridization the sequences are completely complementary, while in non-specific or cross hybridization the sequences contain mismatches.(from [17])


are the weight factors that depend on the position along the probe from the 5' to 3' end, and  

Burden
et al. [14] develop a dynamic adsorption model based on Langmuir isotherm.If x is the concentration of mRNA target and   t  is the fraction of sites occupied by probe-target duplex, then in the forward absorption, target mRNA attaches to probe at a rate and in the backward desorption reaction, target mRNA detaches from probes at a rate   b kt 

Figure 6 .
Figure 6.Hyperbolic response function for the intensity I(x) according to the Langmuir isotherm.

Figure 7 .
Figure 7. Theoretical model for perfect match hybridization.Intensity of perfect match versus hybridization time.(adapted from [24])

Figure 8 .
Figure 8.Time dynamics of hybridization corresponding to perfect and mismatch probes, for a maximum of 30 minutes.

Figure 9 .
Figure 9.Time dynamics of hybridization corresponding to perfect and mismatch probes, for a maximum of 60 minutes.

Figure 10 .
Figure 10.Time dynamics of hybridization corresponding to perfect and mismatch probes, for a maximum of 120 minutes.

Figure 11 .
Figure 11.Profiles corresponding to perfect and mismatch probes for time constants, at 30 and 100 minutes.

Figure 12 .
Figure 12.Top template shows the iterative matching for hidden expression levels.Bottom template shows the iterative matching for perfect and mismatch hybridization.

Figure 13 .
Figure 13.Results for the iterative process of matching.
the standard free-energy changes for 10 possible Watson-Crick nearest neighbours, e.g.Thermodynamics of Microarray Hybridization 469is self complementary and zero if it is not self-complementary.The total difference in the free energy at 37 o , ooGG , ni is the number of occurrences of each nearest neighbour, i, and  

Table 1 .
[3]fied oligonucleotide H   and S   nearest neighbour parameters in 1M NaCl.The tableshows the values of the total enthalpy and entropy for the dimmer duplexes as used in[3].The observed trend in nearest-neighbor stabilities at 37°C is GC/CG = CG/GC > GG/CC > CA/GT = GT/CA = GA/CT = CT/GA > AA/TT > AT/TA > TA/AT, as in Table2.This trend suggests that both sequence and base composition are important determinants of DNA duplex stability.It has long been recognized that DNA stability depends of the percent G-C content.

Table 2 .
Comparison of computed NN free energy parameters at 37 o COn the other hand, the nearest neighbour H   parameters from Table1, do not follow this trend.This suggests that stacking, hydrogen bonding, and other contributions to the H  