Diagnosis of skin cancer by correlation and complexity analyses of damaged DNA.

Skin cancer is a common, low-grade cancerous (malignant) growth of the skin. It starts from cells that begin as normal skin cells and transform into those with the potential to reproduce in an out-of-control manner. Cancer develops when DNA, the molecule found in cells that encodes genetic information, becomes damaged and the body cannot repair the damage. A DNA walk of a genome represents how the frequency of each nucleotide of a pairing nucleotide couple changes locally. In this research in order to diagnose the skin cancer, first DNA walk plots of genomes of patients with skin cancer were generated. Then, the data so obtained was checked for complexity by computing the fractal dimension. Furthermore, the Hurst exponent has been employed in order to study the correlation of damaged DNA. By analysing different samples it has been found that the damaged DNA sequences are exhibiting higher degree of complexity and less correlation compared to normal DNA sequences. This investigation confirms that this method can be used for diagnosis of skin cancer. The method discussed in this research is useful not only for diagnosis of skin cancer but can be applied for diagnosis and growth analysis of different types of cancers.


INTRODUCTION
Skin cancer is a type of cancer that arises from skin. It is the most common form of cancer, globally accounting for at least 40% of cases [1]. It is especially common among people with light skin. Skin cancer is due to the development of abnormal cells that have the ability to invade or spread to other parts of the body. There are three main types: basal cell cancer (BCC), squamous cell cancer (SCC) and melanoma. The most common type is non-melanoma skin cancer, which occurs in at least 2-3 million people per year. Of non-melanoma skin cancers, about 80% are basal cell cancers and 20% squamous cell cancers. Basal cell and squamous cell cancers rarely result in death [2].
One of the main reasons that skin cancer develops is because the DNA is damaged. DNA is the master molecule that controls and directs every cell in the body. Damage to DNA is one of the ways that cells lose control of growth and become cancerous. DNA mutations can also be inherited.
During years few methods have been investigated in order to diagnose the skin cancer. These methods are mainly categorized in two types which are skin biopsy, and image analysis [3]. In case of skin biopsy doctor takes a sample of skin from the suspicious area to be looked at under a microscope. Different methods can be used for a skin biopsy. The doctor will choose one based on the size of the affected area, where it is on the patient's body, and other factors.
In case of image analysis of damaged skin using computers, special features, like particular colours, colour variation and texture are analysed to search for the sign of cancer. In fact, image analysis methods are based on mathematics. For instance, Segura et al. performed a systemic analysis of melanocytic and nonmelanocytic skin tumors, using dermatoscopy, RCM, and histopathology to develop a two-step method for melanoma diagnosis based on RCM features for use as an adjuvant to dermatoscopy [4]. In another extensive work Stoecker et al. analyzed asymmetry, as a critical feature in the diagnosis of malignant melanoma, using a new algorithm to find a major axis of asymmetry and calculate the degree of asymmetry of the tumor outline [5]. See also [6][7][8][9].
Thermal image analysis of damaged skin can be stated as a special type of image analysis. In this category limited works have been reported in literatures. Flores-Sahagun et al. proposed a structured methodology for analysis and diagnosis of basal cell carcinoma (BCC) via infrared imaging temperature measurements. They concluded that their conjugated gradients method was efficient to identify lesioned tissue (which was associated to basal cell carcinoma through a clinical exam together with skin biopsies) in all patients studied even with the use of a camera of low optical resolution (160 × 120 pixels) and thermal resolution of 0.1 °C [10]. Poljak-Blazi et al. also employed infrared thermal imaging for evaluation of the tumour development and discrimination of cancer from inflammation and haematoma [11]. See also [12].
Fractals are defined to be scale-invariant (selfsimilar or self-affine) geometric objects. A geometric object is called self-similar if it may be written as a union of rescaled copies of itself, with the rescaling isotropic or uniform in all directions. Regular fractals display exact self-similarity. Random fractals display a weaker, statistical version of self-similarity or, more generally, self-affinity. Although virtually all natural fractals are random, the concept of self-similarity is best first explored through the study of regular fractals.
The class of regular fractals includes many familiar simple objects such as line intervals, solid squares, and solid cubes, and also many irregular objects. The scaling rules are characterized by "scaling exponents" (dimension). "Simple" regular fractals have integer scaling dimensions. Complex self-similar objects have non-integer dimension. Therefore, it is completely incorrect to define fractals as geometric objects having "fractional" (non-integer) dimension. Fractals may be defined as geometric objects whose scaling exponent (dimension) satisfies the Szpilrajn inequality: where ‫ﬡ‬ is the scaling exponent (dimension) of the object and D T is its topological dimension, i.e., Euclidean dimension of units from which the fractal object is built. For example, in case of Brownian motion: the path of a particle, a line of dimension one, traveling for a long time over a plane region, eventually covers the entire plane, an entity of dimension two [13]. A multi-fractal system is a generalization of a fractal system in which a single exponent (the fractal dimension) is not enough to describe its dynamics; instead, a continuous spectrum of exponents (the so-called singularity spectrum) is needed.
There are limited works which have employed fractal dimension under image analysis techniques in order to analyse the skin cancer. Mastrolonardo et al. introduced the new technique of the variogram and of fractal analysis extended to the whole regions of interest of skin in order to obtain parameters able to identify the malignant lesion [14]. In another work Hall calculated fractal dimensions to represent border irregularity for early detection of melanoma [15]. In a similar work Piantanelli et al. investigated the fractal properties of skin pigmented lesion boundaries [16]. Ng and Coldman focused on using fractal concept in measuring the fuzziness of a mole. In order to overcome the problem of separation of a mole from its surrounding skin in application of variation method and the correlation method, they employed two different methods which manipulated the intensities around the border of a mole. The first one calculated the size of the intensity surface area at different scales and the second method used the average absolute intensity difference of pixel pairs to obtain normalized fractional feature vectors [17]. See also [18].
In spite of all of these works, no work has been reported which analyses the complexity and correlation of damaged DNA through analysis of DNA walks. In this paper we use the concept of fractal dimension and the Hurst exponent in order to analyse the DNA sequences. In order to do this task we illustrate DNA walk as a random walk and by introducing the fractal dimension and Hurst exponent we compute these parameters for DNA walks which extracted from DNA sequences of healthy subjects and patients with skin cancer. The complexity and correlation of patients' DNA walk are discussed in details.

RESULTS
In this section we compute the Hurst exponent and fractal dimension for DNA walks in case of healthy subjects and subjects with skin cancer, and then compare the results for diagnosis of skin cancer.
In order to make a clear comparison, the grand average of the Hurst exponent plots for all of 60 healthy subjects versus the grand average for all of 60 subjects with skin cancer is shown in Figure 1.
As it can be seen in this figure, the overall behaviour of the Hurst exponent variations in case of healthy subjects and also subjects with skin cancer is decreasing behaviour as its value tend to H = 0.5. This behaviour stands for the fact that memory of DNA walk is decreasing in the genome. But as it is clear, in case of damaged DNA, variations of the Hurst exponent show steeper behaviour than variations belong to normal DNA walks.
The small upward deflections seen in both curves stand for the small increases in the genome's memory.
It is clear that by decreasing the value of H (getting closer to 0.5) and accordingly the memory of genome, the predictability of DNA walk is decreasing. But in case of skin cancer DNA walk, the memory and predictability of DNA walks is decreasing faster than normal DNA walks which stands for the fact that the damaged DNA is less able to store information and increases its memory.
Another difference between two curves in Figure 1 can be seen in the values of the Hurst exponent, where in average the Hurst exponent in case of damaged cells has smaller values that are closer to H = 0.5 compared to normal DNA. This stands for the fact that there is less correlation in the damaged DNA walk compared to normal DNA walk. The averaged value of the Hurst exponent variations for subjects with skin cancer was computed as 0.579 which is smaller than the computed value for healthy subjects which is 0.715.
In order to compare mean of the Hurst exponent values in case of each sample we compute 95% confidence intervals in case of healthy subjects and subjects with skin cancer and then determine whether the intervals overlap. As it is known when 95% confidence intervals for the means of two independent populations don't overlap, there will indeed be a statistically significant difference between the means (at the 0.05 level of significance). Figure 2 shows the computed confidence intervals.
In order to analyse the complexity of DNA walk in case of normal and damaged DNA walks, the fractal dimension spectra of DNA walks are discussed here. The grand average of the spectra of fractal dimension plots for all of 60 healthy subjects versus the grand average for all of 60 subjects with skin cancer is shown in Figure 3.
As it can be seen in this figure, the overall behaviour of the Fractal dimension variations in case of healthy subjects and subjects with skin cancer is increasing behaviour. This behaviour stands for the fact that complexity of DNA walk is increasing in the genome. But as it is clear, in case of damaged DNA, variations of the fractal dimension show steeper behaviour than variations belong to normal DNA walks.
The small downward deflections seen in both curves stand for the small decreases in complexity of DNA because of small increases in the genome's memory. By increasing the fractal dimension's value, the predictability of DNA walk is decreasing as the DNA is getting more complex. But in case of damaged cells, the complexity of DNA walks is increasing faster than normal DNA walks.
Another difference between two curves can be seen in the values of the fractal dimension, where in case of damaged DNA the fractal dimension has bigger values compared to normal DNA and this stands for the fact that the damaged DNA walk is more complex compared to normal DNA walk. The averaged value of the fractal dimension variations for 60 subjects with skin cancer was computed as 1.417 which is bigger than the computed value for healthy subjects which is 1.278.
In order to compare the mean of fractal dimension values in case of each sample we compute 95% confidence intervals in case healthy subjects and subjects with skin cancer and then determine whether the intervals overlap. Figure 4 shows the computed confidence intervals.
As it is clear in the Figure, confidence intervals in case of healthy subjects (red bar) with the variation 1.2768 ≤ Y ≤ 1.2792 and subjects with skin cancer (green bar) with the variation 1.4155 ≤ Y ≤ 1.4185 don't overlap, which means they are necessarily significantly different. So this result stands for the significant difference between the fractal dimensions values in case of two groups of subjects. All the analyses which have been done in this research showed that by computing the values of the Hurst exponent and fractal dimension we are able to diagnose the damaged DNA's as they show more complexity and less correlation compared to normal DNA's.

DISCUSSION
In this paper we worked on diagnosis of skin cancer by analysing the damaged DNA. By defining the Hurst exponent and fractal dimension we talk about correlation

DNA and random walk
A DNA sequence is a four-letter (A, C, G, T) text where A, C, G and T stand for the bases adenine, cytosine, guanine and thymine respectively. An example of DNA sequence is …GTGATAGGGTCTCACTCTGT… This sequence in letter can be converted to a quaternary number sequence by changing T into 0, A into 1, C into 2, and G into 3, such as …30310133302021202030….
This genomic sequence is what is contained in the whole set of chromosomes in the nucleus of a single cell. It is a remarkable phenomenon that DNA sequence contained in a cell dictates development of a complete, mature organism from one single cell. Scientists have attempted to decipher the structure and meaning of DNA sequences; however, consensus has not been reached and opinions are diverged.
Various mathematical methods have been applied to investigate the nature of DNA sequences. The chaos game representation of DNA sequences has been reported to produce a unique pattern consistently over different parts of the genome of an organism. From the image generated from the chaos game, characteristics of a DNA sequence can be studied, such as finding association between two letters.
A popular method to graphically portray the genetic information stored in DNA sequences is to use the socalled "DNA walk" representation [19]. DNA Walk is a vectorial representation of DNA sequences transformed into a planar trajectory. It consists first in converting the DNA text into a binary sequence by coding at a given nucleotide positions and at other positions, and then in defining the graph of the DNA walk by the cumulative variables.
A prevalent method for DNA analysis is related to random walk or Brownian motion which led to the discovery of long-range correlation in DNA sequences. The motion of Brownian particle consists of steps of movement in a characteristic length in a random direction; thus, it's also called a random walk. Suppose the particle moves on the x-axis by jumping +ξ or -ξ every τ seconds, then its movement can be plotted as time proceeds.

Figure 4: Comparison of confidence interval for means of Fractal dimensions.
Likewise, DNA sequence can be plotted in a form of timeseries, but the x-axis represents an array of DNA sequence instead of time [19]. This way, the profile of letters can be preserved along the sequence.
DNA sequence can be defined by two of six possible combinations which are purines (A+G), pyrimidines (C+T), Imino (A+C), Keto (G+T), Weak (A+T) and Strong (G+C). Combination of pyrimidine tract with purines tract is long known for analysis of DNA [20]. In this research we chose this combination as it helps for better detection of the long dependence property in DNA sequences (look at [19]). Figure 5 shows map of whole genomic DNA sequences following purine-pyrimidine binary rule: change purines (A/G) to -1 and pyrimidines (C/T) to +1. This creates a one dimensional 'DNA walk' along the genome.
In the next section by introducing the Hurst exponent we discuss about the correlation of the random walks and in special case the DNA walks.

The Hurst exponent and type of motion
In order to analyse the behaviour of a DNA walk, the direction of fluctuation (deflection) from one point to the next point and in a bigger view the correlation of walk should be considered. This behaviour can be studied by computing a time varying parameter, called the Hurst exponent. The Hurst exponent is an indicator of the long term memory of the process and thus, it is the measure of the predictability of the DNA walk.
The Hurst exponent can have any value between 0 and 1, where the value that it gains in each moment determines the behaviour of the next deflection in the signal.
The 'DNA walker' moves either up or down at every base pair according to the binary map of the DNA sequence. If there is no long-range correlation, the walk is a realization of a Brownian motion. Otherwise, we observe a 'walker' with long-term memory and thus a Fractional Brownian motion. Those two processes can be characterized by different values of the Hurst exponent (H). H = 0.5 for Brownian motion and 0 < H < 1 · (H ≠ 0.5) for Fractional Brownian motion.
In case of Brownian motion, H = 0.5, the process is considered to be truly random (e.g., Brownian motion). It means that there is absolutely no correlation between any values of the process and it is hard to predict the future of process. The analysis of the Hurst exponent for fractional Brownian motion can be categorized in two ranges. Firstly, if the Hurst exponent has a value between 0 and 0.5, it means that the process is anti-persistent i.e. the trend of the process at the next instant will be opposite to the trend in the previous instant. Secondly, a value of H between 0.5 and 1 means that the process is persistent i.e., the trend of the process at the next instant will be the same as the trend in the previous instant.
In this research we compute the value of the Hurst exponent for damaged DNA walk and compare its values with the normal DNA walk. This comparison helps us in to find out about the correlation and predictability of damaged versus normal DNA walk. There are different methods which have been developed to estimate the value of H. Rescaled Range Analysis (R/S) and DFA are two mostly used methods of the Hurst exponent estimation. By the initial analysis of the computed Hurst exponent of DNA walks we found out that even R/S method shows higher values of the Hurst exponent than DFA, the standard deviations are lower for R/S so that the confidence intervals are narrower and thus in our case R/S method is more precise. Nevertheless, we found out that both methods show similar results which become closer as the DNA sequence becomes longer.
So in this research we employ R/S analysis method for computing the Hurst exponent which is described in the next section using a sample.

Computation of the Hurst exponent using R/S analysis
R/S analysis is described in many literatures as a famous method for calculating the Hurst exponent of time series. Applying this concept to the DNA sequence, the Hurst exponent can be calculated for DNA walk. The same principle which is applied in case of time series also can be applied to DNA sequence. The calculations are explained here through a sample. Considering: where s is a letter on the sequence of l letters long, and Calculated from table 1, the sum of movements for the entire sequence of length l would be Consequently, Based on the last discussion the value of H suggests that there exists good persistence in the DNA walk as it is between 0.5 and 1.
In order to make a comparison, some of published values of H for DNA sequences and the value computed in Equation (13) are brought in Table 2.
As it can be seen in Table 2, there are good correlations between our governed value in Equation (13) and other researchers' computed values in all cases for normal DNA.
Estimation of the Hurst exponent by R/S analysis is more laborious than this naive estimation. In this research R/S value is calculated for l, l/2, l/4,…, and l/2 n , and for each division of l, average R/S is calculated again. Then a linear regression line is obtained from plotting log(R/S) versus logl. Then, the slope of the linear graph is the estimated Hurst exponent. In this research we calculate

Spectra of fractal dimension
In this section we explain the fractal dimension as a measure of complexity of the fractals and accordingly DNA walk.
The concept of fractal dimension is based on the concept of generalized entropy of a probability distribution, introduced by Renyi [23]. In case of a DNA walk with ξ max and ξ min , where the total range of the value is divided into N bin: The probability that the value falls into the i-th bin of size δξ is computed as: where N i equals the number of items the value falls into the i-th bin. On the other hand, in case of a DNA walk: where s i is letter in the i-th bin in the entire sequence of length l.
Starting with the letter of order q of the probability w i , the Renyi entropy is: Note that for q → 1: The generalized fractal dimensions of a given DNA walk with the known probability distribution are defined as: (19) where the parameter q ranges from − ∞ to + ∞ . Note that for a self-similar (simple) series with equal probabilities w i = 1⁄N , equation (19) yields ‫ﬡ‬q = ‫0ﬡ‬ for all values of q. Also, note that for a constant value, all probabilities except one become equal to zero, whereas the remaining probability value equals unity.
For a given DNA walk, the function ‫ﬡ‬q, corresponding to the probability distribution of walk, is called the fractal spectrum. Indeed, a larger value of the fractal dimension for a given DNA walk corresponds to the presence of more pronounced fluctuations (sharper fluctuations, less expected values of the DNA walk) than in the DNA walks for which the value of fractal dimension of the same order is less. Furthermore, DNA walks with a wider range of fractal dimensions can be termed more fractal than DNA walks whose range of fractal dimensions is narrower, so that DNA walks with the zero range are self-similar (simple) fractals. Now, if the unexpectedness of an event is defined as the inverse of the probability of this event, then steeper spectra correspond to the series in which unexpected values are more dominant, whereas flatter spectra represent those series in which less unexpectedness occurs [9].

Data collection
In order to do the analyses, the sequences of interest were collected from 60 patients with skin cancer (30 male and 30 female) and 60 subjects in complete healthy conditions (30 male and 30 female) that all of them are 35 ± 3 years old. It is noteworthy that patients are in early steps of melanoma cancer. Patients didn't receive any treatment (chemotherapy, etc.) before their recruitment. Before doing the experiments each subject was interviewed by a physician to describe the nature of experiments and then informed consent was obtained from them. It is noteworthy that all procedures were approved by the Internal Review Board of the University. Identity of all subjects remains confidential.
In this research sample were taken from subjects' skin. Samples were homogenized in the TissueLyser II from Qiagen for 1 min at 28 Hz, centrifuged for 60 s at  [21] human beta globin purine-pyrimidine representation 0.708 [22] synthetic model sequence 0.655 [22] DNA genetic sequences 0.663 Governed in Equation (13) 3000 × g and incubated at 56°C for 30 min. After centrifugation at 6000 × g for 20 min the clarified samples were transferred to MACHEREY-NAGEL 8-well strips on the Microlab ® STAR and further processed according to the MACHEREY-NAGEL NucleoSpin ® 8 Plant protocol. The MACHEREY-NAGEL NucleoSpin 8 Plant kit was used for the extraction of genomic DNA based on vacuum filtration. The extraction yields high quality DNA suitable for further analyses.

Data analysis
In order to do the analyses a program was written in MATLAB to generate the DNA walk for the sequences. This program maps the whole genomic DNA sequences using purine-pyrimidine binary rule by changing purines (A/G) to −1 and pyrimidines (C/T) to +1. This procedure creates the DNA walk along genome. After this being established the DNA walk is analysed by computing the Hurst exponent and Fractal dimension.