Spectral Analysis of Global Behaviour of C. Elegans Chromosomes

Afef Elloumi Oueslati1, Imen Messaoudi1, Zied Lachiri2 and Noureddine Ellouze1 Unite Signal, Image et Reconnaissance de Formes, Departement de Genie Electrique, 1Ecole Nationale d’Ingenieurs de Tunis, BP 37, Campus Universitaire, Le Belvedere, 1002, Tunis, 2Departement de Genie Physique et Instrumentation Institut National des Sciences Appliquees et de Technologie, BP 676, Centre Urbain Cedex, 1080, Tunis, Tunisie


Introduction
Fourier analysis is one of the most useful decomposition into frequency bands to provide a signal's variations and irregularities measure.DNA spectral analysis based on Fourier Transform contributes in the systematic search of special DNA patterns which may correspond to biological important markers.For example, the Fourier harmonic analysis of the occurrence of a base "A" can give us the corresponding frequency with amplitude and a phase without being able to locate it in time.However it is interesting to also detect the moments of "silence" of base "A" i.e. the moments when this base does not exist.Such a representation of Fourier is thus limited with signals which contain transitory elements or evolutions in their spectral contents.For these non stationary signals, the DNA sequences, to highlight the frequency behavior, it becomes necessary to give the frequency the possibility of changes over time.It's the time frequency analysis aim assured by the Short Time Fourier Transform.In fact, the punctual aspect is very important to localize particular regions in chromosomes, to characterize the beginning of a protein coding regions or a nucleosome or its end.By depicting the frequencies by a smoothed STFT, a 2D or 3D spectrogram representation, specific regions appear distinctly.In this paper, we are concerned with the periodicities 3, 6, 9 and 10.5.The periodicity 3 discussed in (Anastassiou, 2001;Berger et al, 2003;Cohanim et al, 2005;Kornberg, 1977;Segal et al, 2006;Susillo et al 2003;Trifonov & Sussman, 1980;Trifonov, 1998;Vaidyanathan & Yoon, 2004) is related with protein coding regions (called exons) in the gene.The periodicity 10.5 is related with nucleosome's positions in the DNA sequence and the degree of deformability of the sequence in the DNA helix (Hayes et al, 1990;Trifonov & Sussman, 1980;Widom, 1996;Worcel et al 1981).The periodicity 6 and 9 are specific to C. Elegans organism.

The relevant regions in chromosomes
The specific succession in the bases (A, G, C, and T) constitutes the hereditary message.Each DNA fragment involves a specific protein synthesis process.Proteins are synthesized from a set composed of 20 different amino acids, which are determined by three bases occurring in subsequent order.A group of three consecutive nucleotides with desoxyribose and phosphoric group is called a codon and a total of 64 different combinations specify 20 amino acids and three stop codons, namely TAA, TAG, and TGA.The protein synthesis (Fig. 1) is realized in two steps: (1) the transcription within which the hereditary information is copied into the messenger RNA and, (2) the translation in which the messenger RNA is exploited by the ribosome to form the amino acid chain.To obtain numerical data from this succession of symbolic bases of a DNA sequence, we use binary indicator coding techniques.A model of this DNA structure in such regions is proposed by Kornberg in (Kornberg, 1974(Kornberg, , 1977)).The chromatin is a dynamic structure, oscillating between the nucleosome and open structures depending on the environmental conditions (Kornberg, 1974(Kornberg, , 1977;;Oudet et al, 1978).And each nucleosome is formed by two molecules of each histone (protein) H2A,

207
H2B, H3 and H4.Each nucleosome has a diameter of 12.5±1 nm and contains about 200 base pairs of DNA.This number is varying according to the chromatin's origin (Hayes et al 1990;Kornberg, 1977;Oudet et al, 1978;Worcel et al 1981).In contrast a particle named 'nucleosome core' is invariant in its DNA content about 146 base pairs.Interesting electron microscopic evidence elaborated in (Oudet et al, 1978) suggests that under appropriate conditions a nucleosome could open up into two separate half nucleosomes of diameter 9.3±1 nm.The finding of each type of histones in the nucleosome has suggested that a nucleosome could be made up of two symmetrical halves (Altenburger, 1976).

Genomic sequence analysis based on Short Fourier Transform
In order to give frequencies more precise location in time, Gabor proposes to use a Fourier local analyze with windows.The technique consists in segmenting signal by multiplication by sliding window of fixed length (Mallat, 1999).Each part is analyzed independently with a classic Fourier transform to enhance frequencies behavior.The totality of these transforms forms the short Fourier transform and precise the frequencies location in time.
Applying coding process, the numerical signals are obtained by base's succession description as follows: The classic discrete Fourier transform related to numerical sequence is expressed as: In order to locate the signal frequencies in time, the analysis is applied to sequence's parts generated by multiplication with a sliding analysis window.
For this purpose, the numerical signal x[n] is divided into frames of N length.The expression become
When based on the binary indicator 'A', the equation becomes:
With i is the window's order and the Δn is the adopted sliding value.The window's length must be chosen to have an appropriate number of samples to guarantee the best frequency resolution.On each block x w [n], is applied a Fourier transform to determine Xw [k], k∈[0:N-1], k represents the frequency index.The FT expression associated with each frame is as follows: With binary indicator 'A' coding, the equation is: On the basis of this expression, many representations can be obtained.The sequence is associated to chromosome, the first analyze consists in studying the frequency global behavior.To enhance the frequencies, we used a mean smoothed spectrum.The principle consists in calculating the mean of the obtained spectrum of equation.

[] []
The chromosomes are generally constituted by more than 10 Mbp, so the obtained spectrum needs to be smoothed.A second mean of the mean spectrums is applied.The converted DNA sequence x[n] is divided into frames of M length with an overlap Δm.Each of these frames is also divided into N frames by multiplication with a sliding analysis window w[n].
On each part, a mean smoothed spectrum is generated.Finally, the mean of the spectrum for all the parts is calculated.The final expression of the spectrum is:

The chromosomes coding techniques
This analysis aims to study the chromosome's frequency global behaviour.For this purpose, it is important to enhance particularly the signals generated by the protein coding regions and the nucleosome regions.That's why, three types of coding techniques are considered: A two-dimensional coding based on Frequency Chaos Game Representation which has submerged from the field of physics known as 'chaotic dynamical systems'

Binary indicator's techniques
The linear coding consists in attributing a binary value for each unit of the all indicators.Which are included in {'A','T','C','G', 'TT', 'TA', 'GC'', 'AAA'… 'GGG'}.The marker associated takes the value of either 1 or 0 at location n for the first character, depending on whether or not the corresponding character group exists from the location n.
Codon's binary indicator: the three bases association called triplet or codon have a fundamental role in the process of amino acids fabrication.For these reasons, a coding technique based on these base's association is used.We adopt binary indicators to each of the 64 codons (Table 1) where: [] is the binary indicator of the codon cod and Ns is the sequence's length.This marker takes the value of either 1 or 0 at location n for the first character depending on whether the corresponding character exists from the location n.Let's consider the codon binary indicator

Pnuc: the structural coding techniques
The second coding technique is the Pnuc which is based on local bending and flexibility properties of the double helix; it is deduced experimentally from nucleosome positioning (Pnuc).By considering the matching of both stalks (A-T and C-G) along the helix, one base's pair defines a plane and a direction in this plane.A description of the double helix shows the overlapping of the plans (Fig. 4).When considering that the planes are parallel, passing between planes needs translation and rotation of 34,3° of the orientation of the connection of the plan.Now the plans are not parallel and the axis of the double helix presents curvature.By considering the interaction between a protein, a histone and a DNA's sequence, this interaction is stronger when the contact area between both objects is the biggest.To increase this surface, it is necessary to roll up as much as possible the segment of DNA around the protein, in this way, we have two properties: If the segment of DNA is not rolled up around the protein, it is in position of equilibrium, the curvature is static The stalk must be flexible to allow the additional curvature around the protein.These two properties generate the nucleosome which generates an excessive curvature of the stalk.
Each trinucleotide is replaced by its numerical value given by the Pnuc The signal generated from this coding for a part of chromosome is given by Fig. 5.For clarity purpose the signal is multiplied by 10.In this signal, the periodicity 10 is enhanced to prove that this is a characteristic of helix flexibility.
The CGR paradigm is a holistic way of DNA representation.It provides a unique scatter pictures.In 1999, H. Joel Jeffrey uses for the first time this representation for studying the "non-randomness" of genomic sequences (Jeffrey, 1990).The CGR is an iterative algorithm for drawing fractal images to any desired scale.It maps nucleotide sequences in the [0,1]x[0,1] square.The four letters A, C, G and T are placed at the corners.The binary CGR vertices are assigned to the four nucleotides as: Deriving scatter pictures, the CGR's construction algorithm consists of three steps.First, the four letters A, C, G and T are placed at the corners of a rectangular unit square.Second, the first point is plotted halfway between the center of the square, and the corner corresponding to the first nucleotide of the sequence.Third, the new point are marked successively half way between the previous point and the corner corresponding to the base of each nucleotide read from the sequence (Almeida et al, 2001;Joseph 2006).A generated CGR image can be viewed as an image of distributed dots.Subdividing the unit square into a set of square entries of equal size n, the number of square entries obtained is equal to 2n ×2n.The number of points counted in each sub-square represents the number of occurrence of a particular nlengthen pattern.
For illustration, let's consider a DNA sequence of N nucleotides, the CGR value along this sequence is defined by equation 16.The result will be a square uniformly and randomly filled with dots.
( ) The first point 0 X is usually placed at the center of the square having thus the coordinates (0.5, 0.5).Then, the next point 1 n X + is repeatedly placed halfway between the previous plotted point n X and the segment joining the vertex corresponding to the letter 1 n s + of the sequence.Fig. 7 illustrates the construction process of CGR trajectory for sequence "ATCGG".

Fig. 7. An illustration of CGR trajectory for sequence ″ATCGG″
To derive the CGR plot, the following steps are taken: First place 0 X at the square's center and the four letters at the corners as described before (subfigure 1).From center to vertex A, mark midpoint 1 ( address A) (subfigure 2).From 1 to T, mark midpoint 2 (address AT) (subfigure 3).From 2 to C, mark midpoint 3 (address ATC) (subfigure 4).From 3 to G, mark midpoint 4 (address ATCG) (subfigure 5).From 4 to G, mark midpoint 5 (address ATCGG) (subfigure 6).By identifying local patterns displayed in the CGR square, it is possible to identify correspondent features of DNA sequences (Yu et al, 2008).The fractal nature of this kind of DNA representation can be observed Fig. 8.The clustering dots in the lower corners indicate a slightly high concentration in A and T. It is known that CGR patterns depict base composition.In fact, we divide the CGR space with a grid of size k (i.e (2 k × 2 k ) pixels) and we count occurrence in each quadrant, the frequency of k-lengthen words occurrence can be estimated and the frequency matrix then extracted is called FCGR (Frequency Chaos Game Representation) (Almeida et al, 2001;Deshavanne et al, 2000;Jeffrey, 1990).
The FCGR was first investigated by Deschavanne in (Deshavanne et al, 1999) and later by Almeida in (Almeida et al, 2001).To show the frequencies of the K-tuples, a color scheme normalized to the distribution of frequency of occurrence of associated patterns is used (Joseph & Sasikumar, 2006;Oliver et al, 1993;Tavassoly, 2007a;Tavassoly, 2007b;Makula, 2009;Goldman, 1993;Cénac, 2006;Tino, 1999;reference 44).A grayscale color mapping may also be used.In Fig. 9, the dinucleotide and trinucleotide frequency matrices (k ={2,3}) are obtained for the gene F56F11.4 of C.elegans.Thus, 2 2 x2 2 =16 cells are needed for motifs of length two and 2 3 x2 3 =64 regions to count motifs of length 3. The darker pixels represent the most frequently used words; when the clearest ones represent the fewer used words.CGRs were used for displaying the behavior of sub-patterns within the same input sequence and depicting oligo_mer composition.It forms the basis for similarity and self-similarity algorithms in a different way from traditional alignment of nucleotides.This FCGR cannot follow the evolution of frequencies from the beginning to the end of a given sequence.So, we propose to generate signals from FCGR.We Generate the nth-order FCGR for the hole sequence, and we replace the reading the first n-lengthen word in the sequence, by the correspondent frequency of the same sub-pattern in the FCGRn matrix.The given sequence is divided with a k-length sliding window.A set of K-frames are obtained which are denoted by K-mers.For example when k= {2, 3, 6}, we have 2-mers (S DNA ), 3-mers (S DNA ) and 6-mers (S DNA ).F K (s) is defined to be the frequencies' set of the k-substrings that appear in the sequence S.

The Fourier analysis method steps
The short time analysis is the technique used in order to locate specific regions in a DNA sequence.In this purpose, a mean values of Smoothed Discrete Fourier Transform is applied on sliding window along the DNA sequence to follow the peak's evolution for specific frequencies points.The Fourier analysis algorithm steps are: The converted DNA sequence x[n] is divided into frames of M length with an overlap Δm.
Each of these frames is also divided into N frames by multiplication with a sliding analysis window w[n]: Where i is the window index, and Δn the overlap.The weighting w[n] is assumed to be non zero in the interval [0, N-1].The frame length value N is chosen in such a way that, on the one hand, the parameters to be measured remain constant and, on the other hand, that there are enough samples of x[n] within the frame to guarantee reliable frequency parameter determination.The choice of the windowing function influences the values of the short term parameters, the shorter the window the greater his influence (Mallat, 1999).We select N and M frame length as power of two to apply the Fast Fourier Transform algorithm.
Each weighted block x w [n], of the frame is transformed in the spectral domain using Discrete Fourier Transform (DFT), in order to extract the spectral parameters X w [k], where k represents the index of the frequency ([0, N-1]).The DFT of each frame (in one of M sequence parts) is expressed as follows: Using the mean values, we calculate a DFT mean value for each frame (1: M).The expression of mean DFT is expressed as: Where i correspond to the index frame of N frames ([1...N]), k is the index of the frequency and j correspond to the index frame of M frames ([1: M]).We constitute the matrix ( ) With these obtained values, we can constitute the matrix to represent restricted join time frequency information, known as 2D or 3D DNA spectrograms.This 2D or 3D representation consists of the spectrogram amplitude for a specific index periodicity in a specific nucleotide position in the chromosome.

Results
The method has been applied on C. Elegans genome.The chromosomes have been divided on 1-million's parts.The M frames have a length of 1024 bp and an overlap Δm=256, the N frames of each M frames have length of 256 with Δn =128.The fig. 11 presents some examples for the spectrum related to each of the three coding technique used.In this figure, we show particularly the periodicities 3 and 10 which are closely depending on coding.In order to highlight the various frequencies characteristic of an organism, the tests were carried out with various coding over various sizes of segments and various widths.The example presented in the Table 3 presents the percentage of contribution of the trinucleotides in the highlighting of the various characteristic frequencies at the frequencies 1/3, 1/6.5, 1/9 and 1/10.The table shows that the organism C. Elegans is rich in periodicities and that these periodicities are raised by more than the 3/4 of these coding technique.We notice clearly that for periodicity 3, the rate has raised more 97 %, followed by periodicity 10 which has 90 % and periodicity 9 with 85 %.Periodicity 6.5 is a periodicity which is very marked for this organism 70 % of code contributes to its raising.It translates the existence with a very high rate of 6 bases groups at the periodicity 6.The majority of these groups represent polyA, generally associated for gene purposes.The Fig. 11 presents some spectrum with linear coding based on binary indicator.Each indicator contributes on a specific periodicity enhancement.The ttt binary indicator enhances the periodicity 10 when for the indicators tta et tgg the periodicity 3 is picked up.The 3D spectrograms give more precision on the power's spread around these periodicities.
In fact, the peaks in these frequency locations have different power values (Fig. 12).The spectrogram 3D a third element to the representation 2D.In addition to the localization of the periodicities in the segment, we visualize power associated with each peak.We can distinguish between the peaks which we can find in all the segments for a given periodicity: 10 and 3 and the peaks which are present in certain segments and which were eliminated by carrying out the average The Fig. 12  This figure shows that for the binary indicator 'AA', 'TT', 'AAA' and 'TTT', the peaks around the frequency 1/10.5 are very pronounced.The variation of the degree view angle demonstrates that the peaks are locally spread in the chromosome part.In the literrature, it has been demonstrated both with the biochemical and signal processing studies, that the periodicity 10.5 related to the nucleosomes is varying.That's why, these figures shows in one hand that there is peaks around this periodicity and in the other hand the peaks are spread in specific regions in the chromosome.
The Fig. 13 represents the spectrograms recovered after PNUC coding.The analysis breaks up the chromosome made up of 15,2Mbp into 15 parts of 1Mbp.We find the localization of the periodicity in the ends.In reality, the periodicity peaks are missed or have of very weak power in the sequence going of 6 Mbp with 12 Mbp, it is not localized on the centromer but it is around ends of the helix.We find it towards the position 13 Mbp until the end.In the parts where it exists it is not continuous, it is localized in specific time's lapses.
In Fig. 14 mean valued technique based on smoothed Discrete Fourier Transform was applied along the parts 6, 9 and 13 of the chromosome 1 of C.elegans.From the 1D, 2D and 3D plots, it is observed that coding with FCGR 2 reveal the presence of both 10.5 and 3 periodicities.The peaks are spread with different values according to parts around each of these periodicities.Each part has each own specificity.In fact, in part 9 (subfigure a) , periodicities 3 and 10 just submerge from the frequency behavior with peaks of modest values.For the part 6 these periodicities have the same behavior, the specificity is the presence of horizontal peaks around the location 750 in this part.When the part 13 is rich in periodicities 10 and 12 and poor in periodicity 3.
For coding with FCGR-3 (Fig. 15), the very pronounced peaks correspond to the 10.5 periodicity; just in the left side other peaks appear around the frequency 0.11 which corresponds to the 9 periodicity; in the right side a few peaks occur around the frequency 1/12.The 3 periodicity disappears in the majority of the parts and when it appears, it is present only on a few areas with very low amplitudes.In Fig 15, we can distinguish between frequency behavior in the three parts represented.The periodicity 10 is more pronounced for the part 16 (subfigure b) when comparing with part 9 (subfigure a) and 10 (subfigure c).
As for the hexamers coding (FCGR 6 ), we find that it enhances the frequency 1/10.5;upon rare zones the frequency 1/12 is observed (Fig. 16).We clearly notice that this coding technique enhances the periodicity 10 and his neighbor in opposition to periodicity 3. The three parts shows different aspect of the repartition of the periodicities.In part 9 (subfigure a), the peaks are spread in a "large" frequency band around periodicity.The band is reduced for part 16 (subfigure b) to be located in two frequencies then the power is grouped in one frequency for part 12 (subfigure c).A peak around frequency 1 / 4 nearby at position 2500 corresponds to a satellite (Fig. 16 subfigure a).This frequency derives from repetitions of certain dinucleotides in the area.
The spectrogram reveals the presence of a satellite with multiple frequencies; is manifested clearly in the 3D graph in the form of horizontally aligned peaks colored in red, the higher frequencies.

Conclusion
In This chapter, we investigate the contribution of each coding technique: the linear, the two-dimensional and the structural one in the enhancement of the peaks related to the C. elegans genome periodicities.For this purpose, we use a mean values of smoothed Discrete Fourier Transform applied on sliding window along the DNA sequence to follow the peak evolution for specific frequency points around the frequencies.We detect periodicities around 3, 6, 9 and 10 and found periodicities 3 and 10 related respectively to genes and the positions of the nucleosomes.First we evaluate the frequencies spread through the chromosomes with a 1-D spectrum.Second, we consider the 2-D and 3-D DNA spectrograms to visually detect the specific parts of chromosomes related with protein coding regions, nucleosomes positioning regions, and other particular regions.
The time frequency analysis made it possible to follow the periodicities' evolution.We studied the contribution of a range of binary indicators for the raising of exons' peak frequency.We also studied the localization of the areas being able to form nucleosomes.Thanks to the spectrogram with two dimensions, we visualized the localization of the areas corresponding to periodicity 10 in the limits and not in the center of the helix.The threedimensional spectrogram showed that the raised peaks do not correspond to the periodicity 10 but we see clearly in certain sequences and for some indicators two lines of peaks of variable powers around this periodicity.This result can explain the variation between 10 and 10.7 of the periodicities associated with the nucleosomes presented in the literature.It is also observable that these peaks are alternated around two periodicities; this result could be associated with the phenomena of chromatin compaction.

Fig. 1 .
Fig. 1.The protein's synthesis stepsIn a DNA sequence, electron microscopy and biochemical studies have established that the bulk of the chromatin DNA is compacting into repeating structural units, named nucleosomes.A model of this DNA structure in such regions is proposed by Kornberg in(Kornberg, 1974(Kornberg, , 1977)).The chromatin is a dynamic structure, oscillating between the nucleosome and open structures depending on the environmental conditions(Kornberg, 1974(Kornberg, , 1977;;Oudet et al, 1978).And each nucleosome is formed by two molecules of each histone (protein) H2A,

Fig. 2 .
Fig. 2. Chromatine's and nucleosome's structure In order to study the protein coding regions signals and the nucleosome regions ones, the DNA symbolic data must be converted to DNA signals.

Fig. 4 .
Fig. 4. A description of the double helix shows the overlapping of the plans Fig. 6 illustrate the stft method applied on this resulting signal.First, subfigure a shows a mean spectrum for distinct window of length 5*10 5..The spectrum obtained needs smoothing so for the second figure (subfigure b) a blackman smoothing window is applied on each signal part before calculating the mean spectrum of equation 7.In the third and last figures (subfigure c and d) the equation 8 is used and the parameters chosen are: Blackman window, M=5*10 5 , N=5*10 4 and overlap 50% for subfigure c and N=5*10 3 with overlap 10% for subfigure d.The figure shows that meaning and smoothing are very efficient to have the best signal (subfigure d).

Fig. 9 .
Fig. 9.The FCGR2 (k=2) and the FCGR3 (k=3) for the gene F56F11.4 of C. Elegans Generating signals from FCGRs was a good way to capture such variability.For this fact, a new 1D graphical representation of DNA sequences is introduced, which provide useful insights into local and global characteristics of genomic sequences.This novel algorithm of DNA coding consists of computing the k th -order FCGR for the whole sequence and assigning then the value of the correspondent frequency to each k-lengthen word in the sequence.Thus allows us to follow the frequencies' evolution along a given sequence.The the obtained plot set is called FCGR k -signal.Let's We consider the given sequence DNA S DNA S = ' TTTAAAAGCTCGCGCTAAAA'

Fig. 10 .
Fig.10.Also shows the slightly high concentration in AA and AAA motifs in FCR2 and FCGR3 which are expressed by the high-rise blocks in the correspondent signals.On the signals obtained, a spectral analysis is applied to detect the frequency global behaviour in the spectrum for each C. Elegans chromosome.

Fig. 11 .
Fig. 11.Examples of spectrums and spectrograms generated with a mean valued technique based on smoothed Discrete Fourier Transform applied on sliding window along the DNA sequence parts of C. elegans genome.Two coding methods are used: a-linear coding technique (binary indicator) (subfigure a), b-structural coding technique (PNUC) (subfigure b)

Fig. 12
Fig. 12. 3-D spectrograms for binary indicators coding is divided on 4 subfigures.Each one add to the 2D spectrograms the power values and locations of the periodicities Enhanced: it represents the 3-D spectrograms.Subfigures (a) and (b) are related to chromosomes2 of C. Elegans when the subfigures (c) and (d) concern chromosome 3.

Table 1 .
Codon associated to each base www.intechopen.com

Table 2 .
The PNuc table

Table 3 .
The proportion of contribution of the trinucleotides in the highlighting of the various characteristic frequencies