DNA Sequences with Forbidden Words and the Generalized Cantor Set ()
1. Introduction
Researchers have been interested in the relationships between fractals and DNA structures for years. Just recently, Anitas and Slyamov [1] studied multiscale fractal representing DNA sequences using small-angle scattering analysis. Cattani and Pierro [2] conducted a multifractal analysis of binary images of DNA in order to define a methodological approach to the classification of DNA sequences. Badea and her collaborators [3] characterized the geometry of some medical images of tissues in terms of complexity parameters such as the fractal dimension (FD). Carlo Cattani presented analysis of DNA based on the indicator matrix together with some elementary approach to a fractal estimate of DNA sequences in the book [4] edited by Elloumi and Zomaya. Albrecht-Buehler [5] identified explicitly the GA-sequences as a class of fractal genomic sequences. Ainsworth [6] investigated how the cell’s nucleus holds molecules that manage human’s DNA in the right location. In a book edited by Crilly, Earnshaw and Jones, Voss applied standard spectral density measurement techniques to demonstrate the ubiquity of low frequency noise and long range fractal correlations.
The study of the genome or DNA sequences through fractal analysis is very interesting. DNA sequences can be seen as sequences over the alphabet
. Subsequences that do not appear in DNA are considered as forbidden words. A visualization method of the forbidden words in [7] [8] [9] [10] [11] has been designed by B.-L. Hao since 2000. This method is now called Hao’s frame representation. Recently, C.-X. Huang and S.-L. Peng discussed this method in detail, and many beautiful graphics were provided in [12] [13] . From these geometric intuitions, it can be observed that these forbidden words demonstrate certain fractal properties. In fact in this work we generated some amazing fractal graphs associated with DNA sequences with forbidden words as shown in Figure 1.
It is important to explore the fractal generating mechanism that is associated with the forbidden words in the sequence. H. J. Jeffrey [14] [15] and P. Tiňo [16] [17] tried to associate the forbidden words with the IFS (Iterated Functions Systems) using chaos game algorithm. Denote
as the set of all finite sequences over
. Then how to find a generating formula or the mapping
, where w is a sequence that does not contain forbidden subsequences, or corresponding iteration method? As was pointed out by P. Tiňo, the IFS is a multifractal and therefore the generating formula would be relatively complicated.
In order to detect the structures of some symbolic sequences, one has to find the properties of their topology and metric and be able to visualize these sequences. To do this, we have to provide a type of graphical representation together with their topology and metric properties so that we can directly reveal their corresponding fractal graphs. This kind of representation method is important and necessary.
For an alphabet with cardinal 3, the well known CGR method (that is, Chaos Game Representation method) was first introduced by M.F. Barnsley by considering the points in an equilateral triangle. The substrings of a string were shown graphically (see [18] ). For an alphabet
with cardinality 4, the CGR method was later generalized by H.J. Jeffrey so that the DNA sequences can be visualized (see [14] [15] ). The authors have transformed the DNA sequences into pseudo random walk in a 2-dimensional plane or in a 3-dimensional space [19] [20] [21] . We notice here that an iterated function system can be applied to construct a graphical representation of some DNA sequences [16] [17] . The points in the unit square
can be used to denote the substrings of the DNA sequences. Consequently, the four vertices of the unit square are labelled as
.
In application, the frame representation method proposed by Hao et al. is more intuitive and visual [9] [10] . The unit square
is divided equally with vertical and horizontal lines so that there are
congruent small squares with side length
and area
. For the alphabet
with cardinality 4, each small square of side length
is used to denote the string in
regularly (See 1-, 2- and 3-frame graphs in Figures 2(a)-(c)).
Figure 1. Graphs of some forbidden words.
(a) (b) (c)
Figure 2. The frame representation method of B.L. Hao et al. (a) 1-frame graph; (b) 2-frame graph; (c) 3-frame graph.
With the frame representation method of B.L. Hao, the repetition topology structure of the subsequences (i.e. the strings in
) of a DNA sequence can be easily visualized and efficiently drawn. The avoided or the under-represented short strings in the genome sequence form the forbidden words. These forbidden words are the reasons or the basis of the constructed fractals.
P. Tino [16] [17] proved the equivalence of the CGR method and the frame representation method of B.L. Hao et al. He noted that the cardinality of an alphabet can be generalized to a square integer (
simultaneously for some integer b). We will in this paper extend the above methods and relax the restriction to the cardinality of an alphabet.
The order of this paper is as follows. In Section 2, we will first convert the problem into the discussion on certain type of generalized Cantor set, which can naturally correspond to multifractals, and then in Section 3, we will induce Hao’s frame representation according to the principle that the correspondence between line segment and unit square is one-to-one [22] . Several examples, along with their fractal graphs, of some generalized Cantor sets are given at the end of this paper.
2. Forbidden Words and the Generalized Cantor Set
Rewrite the alphabet as
. We first give the following definition.
Definition 2.1 Let
. Denote B as the set consist of l finite sequences with length
:
(1)
Then call the infinite sequences over
(2)
the DNA sequence with no forbidden words B, a.k.a. allowed sequence.
It is known that when
is expanded in ternary representation, the subset in
is called the Cantor set. Similarly, with quaternary expansion, we give the following definition.
Definition 2.2 When
is represented in quaternary expansion
(3)
we call
(4)
the generalized Cantor set.
Apparently, the discussions on DNA sequences (1) (2) that contain no forbidden words B can be converted into the discussion on the generalized Cantor set
.
Let
,
,
, and
(5)
Then, the condition
in Definition 2.2 can be rewritten as
Theorem 2.1 The generalized Cantor set
can be inducted by using an iteration method.
Proof. In fact, for the
th step of the quaternary expansion of
, there is
(6)
Let
(7)
Substitute (7) into (6),
(8)
In general, we let
(9)
and as
, we obtain the generalized Cantor set
(2.2).
,
are
intervals in
with length
. From the iteration Equation (7) in the theorem, the iteration acts differently on the l subintervals than on the
intervals. Hence we have [11] .
Corollary 2.3 The generalized Cantor set
is multifractal.
Proof. In the construction of the generalized Cantor sets
, measures on removed portions are redistributed to the neighboring sections repeatedly. Thus
is multifractal.
Obviously, the generalized Cantor sets are applicable for all p-carry representation (p is an integer).
3. The Hao’s Frame Representation of the Generalized Cantor Set CG
The theoretic foundation of the construction of DNA sequences can be seen in [12] . The subintervals in the quaternary expansion of
can be one-to-one corresponding to the subsquares that are obtained by repeatedly equally dividing the unit square (and its subsquares) into 4 smaller subsquares. Cantor sets are created in one dimension in
while Sierpinski sets are constructed in two dimension within
. Using the corresponding relationship between the unit interval and the unit square, we can convert the discussion on the generalized Cantor sets into the discussion on the generalized Sierpinski sets on the unit square.
Let
. The binary expansion of
is
(10)
The expansion can be related to the quaternary expansion of
as follows:
(11)
Thus the forbidden words
in
can be represented as
(12)
Definition 3.1 Let
and the binary expansion of
is (10). Then call
(13)
the generalized Sierpinski set that corresponds to the the generalized Cantor set
.
Theorem 3.1 The generalized Sierpinski set
can be inducted by iterating method.
Proof. The
th binary expansion of
is
(14)
Let
(15)
Substitute (15) into (14), we have
Generally, let
Noticing the corresponding relationship between numbers and the subsquares, naturally we have Hao’s frame representation. The second-order Hao’s frame representation can be inducted from the corresponding relationship illustrated in Figure 3.
The next few examples illustrate analytic structure of some DNA sequences along with the fractal graphs of the relevant generalized Cantor sets.
Example 3.2 Let
,
. Then
. Hence the arithmetic expression of the generalized Cantor set is
And the symbolic sequence is
which is shown graphically in Figure 4.
Example 3.3 Let
,
. Then
. Hence the arithmetic expression of the generalized Cantor set is
And the symbolic sequence is
with graphs Figure 5:
Example 3.4 Let
,
. Then
. Hence the arithmetic expression of the generalized Cantor set is
And the symbolic sequence is
which are shown below
Similarly, we could produce the following amazing fractal graphs shown in Figure 6, Figure 7, of different DNA sequences with various forbidden words.
Figure 3. Hao’s frame representation of
.
Figure 6.
.
4. Conclusion
We established relations between the generalized Cantor sets and some DNA sequences with missing words. And we have associated Hao’s frame representations and the generalized Sierpinski set with the generalized Cantor sets. The authors are interested in applying the analytical representation method to study the graphical results of space filling research works (cf. [23] [24] [25] ).