Signal Analysis on Strings for Immune-Type Pattern Recognition

We use wavelet-type discrete transforms for signal analysis on strings of finite length. We apply these transforms for edge and hidden Markov process detection. We also present new approaches for string matching and for measures of the diversity of chaotic strings.


Introduction
The immune system is one of the most effective pattern recognition systems.This system deals with RNA strings on proteins and viruses and is involved in several operations and recognition modes capable of: • Detecting local particularities.
• Discriminating and taking decisions.
In order to partially model these recognition capabilities, we introduce new discrete transforms, providing an effective background for immune-type computational applications on spaces of strings.All these transforms exploit local information, as does the immune system.
In this review, we discuss how the discrete tree transform (DTT), introduced in the Karanikas and Proios [10] model of antigen processing, and then we show (as in Atreas et al. [2]) how DTT achieves edge detection.We explain how we apply DTT structures to strings to detect hidden Markov processes, we introduce a measure for the diversity of strings based on fractal dimension formulae, as in Bisbas and Karanikas 1990 [6], and we present a novel string matching method based on analytic number theory.
As Felix Browder, the President of the American Mathematical Society, said in his Retiring Presidential Address [4]: In molecular biology, mathematics has a much greater role to play than people realize, even though mathematics has had, for example, a significant effect on the course of the genome project.There will be an even larger effect when it comes to analysing how the genome actually creates living cells . . .The rituals of classical statistics no longer suffice to deal with many problems that people face, especially when they have large masses of data -and large masses of data are the basic ingredient of the modern world.
In this short review we hope to make clear that new mathematical methods applied on strings of biological data could provide a new era for bioinformatics.

Definition 1
Let p = 2, 3, . .., the p-adic approximation of a non-negative data collection T = {t 1 , . . ., t p N } is given by: Obviously, the collection: has a p-adic tree structure with N + 1 generations, such that: R 0,1 (T ) corresponds to the initial node of the tree; each R n,k (T ), n = 1, . . ., N − 1 corresponds to the k node of the nth generation and R N ,k (T ) is the k branch (or leaf) of the last generation.

Definition 2
The walks a n,k (T ) are the following real numbers: where for any real number x , [x ] is the smallest integer less than or equal to x .The DTT of T is the collection of all walks, a n,k (T ) as above.
Obviously, DTT cuts data into successively smaller and smaller p-adic pieces (peptides), mimicking antigen processing.Local singularities are represented by sets of ratios called walks.Walks, as do peptides, represent local singularities and allow the reconstruction of the initial data.Indeed:

Proposition 1
The DTT of T = {t 1 , . . ., t p N } satisfies the multiplication formula: Note that R N ,k (T ) = t k .Thus, for n = N , the formula reconstructs the initial data set (leaves of the tree), while for n < N it reconstructs the branches of the tree.The notion of DTT can be easily extended on finite strings, as shown in the following:

Example
The binary walks of the data {c, t, g, c, a, a, a, t} are the following: As do antigenic peptides, the walks show the singularities of the processed antigen and can reconstruct it, e.g. to reconstruct the first element of {c, t, g, c, a, a, a, t}, multiply the related walks: DTT has several interesting properties, which we shall see next.

Edge detection on 2D-plane curves
Edge detection of time series is a computational process consisting of operations aiming to detect extreme changes in the shape of a pattern.Since operations of DTT are capable of erasing short local variabilities and capturing the relevant Strings for immune-type pattern recognition 71 extreme points, we have presented a method for edge detection of time series [2].In this section we use DTT for detecting the singularities of 2D-plane curves: T = {(x 1 , y 1 ), . . ., (x p N , y p N )}, where p is a prime number and N = 2, 3, . . .

Definition 3
The p-adic approximation of T is given by the complex numbers: For any n = 1, . . ., N − 1, the norm of the nth p−adic approximation of T is given by the formula:

Proposition 2
(a) There exists a unique index 1 < n 0 < N , such that |||V n 0 (T )|| − ||V (T )||| is minimum.(b) Let n 0 be the index of T as in (a); if P n 0 ,k are the points in plane represented by the complex numbers R n 0 ,k (T ), then the set: determines the position of the relevant extreme points of the n 0 -approximation of T , where , is the usual scalar product.(c) The set {t α(s) : a(s) = p N −n s, s ∈ J k (T )} determines the locations of the main edges of the graph T .

Example
We randomly select a curve of the plane consisting of 121 points (Figure 1).Applying Proposition 2, we get the extreme edges of the curve (Figure 2).

Edge detection of 2D-images
The simplest way to model the binding energy between proteins is in terms of a bilinear form (mechanical/chemical energy form) (see [12,13]).
The energy bilinear form is determined by a real rectangular matrix M .Using the SVD analysis of the matrix M : where the singular vectors L, R can be considered as a mathematical model of 'antibody probes', while the real number (−s) is their binding energy, two-dimensional images are reduced to two onedimensional 'antigens'.Then we use our DTT algorithm [2] for edge detection of time series.

A method to identify hidden Markov process
A hidden Markov model of a set of data {h(1), . . ., h(N )} is a finite set of probabilities, distribution B = {b 1 , . . ., b p m+1 }, where p is a prime number and m ≥ 1 is an integer called Markov memory, such that: 1.
and Mod(m, n) gives the remainder of the division of m by n.
It is clear that the collection {d(n, j ), n = 1., M , j = 1, . . ., p n } has a tree structure with M generations.Obviously, h(j ) represents the overall probability of being at the j th branch of the M th generation with respect to a certain concatenation of the branches of the tree structure.Now, given a part of a hidden Markov model of length N : H = {h(1), . . ., h(N )} we shall detect p, m and B = {b 1 , . . ., b p m+1 }.
Our algorithm is the following: (a) Let P be the set of all primes.For each p i ∈ P we find M i , such that: p M i < N < p M i +1 .(b) For any Mi, determine the set:

On measuring the diversity of strings
It is well known that the immune system can effectively recognize a large variety of peptides of viruses.In the case of intrusion of unknown viruses, the immune reaction provides anti-viruses whose peptides differ significantly from what is 'stored' in the 'memory' of the immune system (innate immunity).
In computational analysis a typical measure of diversity is the entropy formula.The entropy of a string written in an alphabet of r letters or digits is given by the formula i p i log(p i )/ log(r), where p j is the probability of appearance of the letter or digit j .This formula is unsatisfactory for measuring the diversity of strings (or collections of strings), because when the digits are almost equidistributed, the entropy is approximately 1.In fact, on a typical RNA we estimated the probabilities: 0.274, 0.192, 0.20 and 0.33 for A, C, G and T, respectively.