Dihedral Reductions of Cyclic DNA Sequences

The data-analytic methodology of dihedral reductions for cyclic orbits of distinct-base codons is described both in terms of Fourier analysis over the dihedral groups and in (algebraically equivalent) terms of canonical projections. Numerical evaluations are presented for discrete and continuous scalar data indexed by cyclic orbits.


Introduction
The role of group-theoretic arguments in biology has a long history, in which R.A. Fisher's classification of segregation genotypes in the theory of polysomic inheritance is a classical example [1]. In the theory of experimental designs, it was also Fisher who demonstrated the explicit usefulness of cyclic groups in the theory of confounding in factorial experiments [2,3], now widely used in biology and genetics studies. In recent decades, the theory and applications of algebraic methods in statistics and probability became a well-established area of interest, e.g., [4,5].
In structural biology as well, applications of symmetry arguments have been used to formulate working hypotheses and to suggest explanation and prediction, e.g., [6][7][8][9][10]. An explicit connection between symmetry arguments and data analytic reasoning in structural biology can be exemplified by the study [11] of the evolutionary importance of purine and pyrimidine content in the human immunodeficiency virus type 1, based on the statistical assessment of the frequency diversity of cyclic sets, defined as the ratio of extreme (max/min) frequency counts (f ) in the cyclic set O evaluated over a given region of the genome. In the context of symmetry studies, the frequency diversity is just but one of the many possible data summaries indexed by cyclic sets or orbits. In the present communication, the frequency diversity, as well as cyclic summaries such as the raw sum of frequency counts along the orbit, will be shown to share similar algebraic and data-analytic structures. More specifically, the present communication is aimed at showing that there is a broader group-theoretic and data-analytic framework within which the methodology described in [11] can be identified and further utilized, thus leading to eventually richer biological interpretations and explanatory narratives.
The framework of interest (Symmetry Studies) was described originally in [12] and is briefly reviewed in [13]. We will also refer to [14] for notions of Fourier analysis over the finite groups relevant to the present applications. See also [15][16][17][18] for related discussions, and [19,20] for applications in the field of linear optics.
This paper is divided as follows. The basic definitions, assumptions and notations are introduced in the next section. The cyclic reductions are discussed in Section 3. Numerical evaluations are presented in Section 4. Additional background material is presented in Appendices A, B, and C. where, typically, permutations τ in subgroups G of the full symmetric group S act on the left according to

Definitions, Assumptions and Notation
The inverse of the group element appearing in the group action τ · s = sτ −1 is necessary so its defining property η · (τ · s) = (ητ ) · s can be verified. Permutations σ in subgroups H of S 4 act on the right according to σS, and subgroups G × H of the direct product group S × S 4 act bilaterally according to In what follows, we will indicate by the orbit generated by the left action of the cyclic group C on a given sequence S ∈ A L . We will refer, generically, to these sets as cyclic orbits or cyclic sets.
Throughout this communication, sequences written in lower case will always indicate the cyclic orbit generated by the corresponding sequence, to be written in upper case. For example, act = {ACT,CTA,TAC}, agct = {AGCT,GCTA,CTAG,TAGC}.
It follows directly from (4) that two sequences S and F are complementary if the permutation in S 4 , written in cycle notation, representing the standard DNA complementarity of base pairs. For example, ACT and TGA are complementary sequences, and in that case we also say that their corresponding orbits act and tga are also complementary. Specifically,

Injective Sequences
The dihedral reductions to be considered in this communication are obtained for DNA sequences in length of three (or codons) composed of distinct bases {A,G,C,T}. That is, for the injective mappings into A with domain L = {1, 2, 3}. These sequences account for the 24 distinct injective codons factored into 8 distinct cyclic orbits of length three.
Although the group actions (3)-(5) are defined for all mappings in A L , the resulting data-analytic applications may need to be adapted when non-injective sequences are included, due to the fact that the resulting actions may no longer be transitive. In that case, the data analysis is carried piecewise within the transitive parts [12]. In addition, because the actions on the injective sequences are faithful, any experimental results indexed by the points in the orbit are in one-to-one correspondence with the group elements, and can consequently be indexed by the group elements themselves. It is in the resulting group algebra structure that Fourier transforms can naturally be defined.

Scalar Measurements
Throughout this paper it will be opportune to distinguish the following types of experimental data: • Data indexed by sequences, x : A L → R, indicated by x S ; • Data indexed by cyclic orbits, x : O → R, indicated by x s ; • Data indexed by group elements, x : G → R, indicated by x τ , x σ , . . ..
For example, the frequency diversity Equation (1) for act in terms of frequency counts x S over a given region of the genome is given by whereas the raw sum Equation (2) for the same orbit is

Orbit Invariance
Every symmetry orbit has an intrinsic arbitrariness in the choice of its generating point, so that the resulting orbit is the same regardless of its generator. For example, recalling Equation (6), Therefore, one would want the corresponding data summaries x act , x cta , x tac to be stable, or invariant, under different choices of orbit generators. Obviously Equations (7) and (8) are both orbit invariants. This is a universal requirement that applies to all summaries obtained from data indexed by symmetry orbits.
A class of data summaries with this (orbit) invariance property, as shown in [14], is given precisely by the Fourier transforms < x, ξ >= τ ∈G x τ ξ τ , evaluated at the (irreducible) representations ξ of G. The invariance property says that, regardless of the different orbit relabelings τ x, their Fourier transforms < τ x, ξ > stay bound to certain well-defined (irreducible representation) subspaces of the original data module or vector space [14]. That is, so that the transforms reduce as the corresponding irreducible characters. A class of (faithful) group actions on the cyclic orbits that allows us to identify x τ with x s and evaluate the Fourier transforms (or orbit invariants) will be introduced in the next section.

Dihedral Orbits
The dihedral groups D n , for n = 3, 4, . . ., can be realized as the group C n = {1, r, r 2 , . . . , r n−1 } of rotations of a regular n-side polygon, adjoined with the corresponding reversals C n h = {h, rh, r 2 h, . . . , r n−1 h}, or h-mirrored rotations, giving D n a (non-commutative) group structure of order 2n. In addition, when n = 2,

Invariant Reductions
In Diagrams (10) and (11), D 3 rotations and reversals are shown sideways along the rows of the diagrams and complementary orbits are shown along columns, so that each box is labeled by a cyclic orbit. We shall refer to the orbits in each of the diagrams simply as conjugated orbits.
In addition, each orbit is labeled by the polarity (⊕, ) of the sequence's strand and by the encoding sense (→) or anti-sense (←) direction with which a gene or protein product reads off the sequence. More specifically, following (10), if any point in the act orbit is labeled with a positive polarity and with a reading sense direction then: • The corresponding point in the tca orbit has positive polarity and the reading is in the anti-sense direction; • The corresponding point in the tga orbit has negative polarity and the reading is in the sense direction, and; • The corresponding point in the agt orbit has negative polarity and the reading is in the anti-sense direction.
Diagram (11) shows the complementary orbits gct and agc, with the same polarity and direction interpretation as in Diagram (10). Figure 1 shows a configuration space for the conjugated cyclic orbits of Diagrams (10) and (11), relative to which Figure 2 shows, respectively on the left and right images, the common direction and common polarity configuration subspaces. In this configuration space (obviously not unique), same-direction subspaces span two intercepting tetrahedrons, whereas same-polarity subspaces span two parallel faces of the configuration space.

D 2 -Invariant Reductions
There is a transitive faithful action of C 2 × C 2 D 2 on the cyclic orbits s of (10), given by Specifically, τ σ act tca tga agt As a consequence, any experimental data x s indexed by the orbits (s) in the diagram can be reduced by the tools of dihedral Fourier analysis over D 2 . We emphasize that the transitiveness and faithfulness of the D 2 action on the set of orbits is necessary to identify the orbits with the group elements and then proceed to the determination of the (D 2 ) orbit invariants using the Fourier transforms. Following [14], these four one-dimensional transforms are simply and constitute a set of one-dimensional orbit invariants for the data [14]. Similar reductions can then be obtained for (11).

Entropy Invariants
The orbit invariants determined above are functions of any scalars x s obtained over the orbit s, such as its diversity (7), its raw sum (8), or its total molecular weight. When x s are positive integers, such as the sum of frequency counts over the orbit s, then the entropy (Ent ) of the observed distributions (L) of frequency counts given by, and are also orbit invariants [17,21]. In Section 4 these orbit invariants are evaluated for Diagrams (10) and (11) to describe a specific DNA sequence and query it for potential structural variations along the genome.

D 4 -Invariant Reductions
There are three non-equivalent transitive right actions σs of D 4 on the set of all injective cyclic orbits in length of three, jointly reducing Diagrams (10) and (11), generated by: The action of D 4 < (ATGC), (AG) > is given by: act tca tga agt gct tcg cga agc thus showing that any experimental data x s indexed by O 3 can be reduced by dihedral Fourier transforms over D 4 or the corresponding canonical projections decompositions. These two views, both leading to the identification of the orbit invariants, are outlined next.

Canonical Projections
The linear representation of D 4 in R 8 defined by (16) Similarly, the action generated by D 4 < (AGCT), (AC) > yields a linear representation of The resulting canonical projections P ξ , indexed by the irreducible representations ξ ∈ {1, α, γ + , γ − , β} of D 4 , evaluated for the representation of (17), are shown in Appendix C, along with the components x P ξ x of the resulting decomposition of the sum of squares. More generally, ||x|| 2 = ξ x P ξ Φx, where x Φx is an Euclidean fundamental form [22]. These decompositions are often used for the statistical analysis of continuous data (analysis of variance).

Interpretation of the Components x P ξ x
The particular representation (17) leads to the following interpretation of each of the non-trivial (orbit invariant) components x P ξ x of ||x|| 2 , in terms of the combinations of polarity (⊕, ) and direction (→, ←), corresponding to the components of where, here for simplicity of notation, we let s (the labels) indicate x s (the data indexed by that labels).
The projection P α identifies a one-dimensional invariant comparing the overall mean effects between rotations and reversals. The projection P γ + identifies a one-dimensional invariant combining the overall within-rotation sensitivity to polarity given overall direction variation as assessed by with the corresponding within-reversal variation assessed by The projection P γ − identifies a one-dimensional invariant contrasting the same variation described above. Lastly, the projection x P β x identifies a two-dimensional invariant assessing direction given polarity effects in terms of

Dihedral Fourier Analysis
Reading from the column under the act orbit in (16), the points in the group algebra CD 4 are given by from which we obtain the corresponding Fourier transforms < x, 1 > = act + agt + tcg + cga + gct + tga + tca + agc, < x, α > = act + agt + tcg + cga − (gct + tga + tca + agc), < x, γ + > = (act − agt + tcg − cga) + (gct − tga + tca − agc), Parseval's equality ||x|| 2 = ξ∈ Dn n ξ 2n || < x, ξ > || 2 establishes the correspondence with the decomposition Equation (20) obtained in terms of the canonical projections. It is opportune to remark here that in the definition of Equation (21) we arbitrarily assigned the identity in D 4 to x act . Any of the other potential assignments would be precisely a relabeling of the orbit's starting point. The Fourier transforms, however, would remain orbit invariant, in the sense of Equation (9).

Numerical Evaluations
In this section we apply the cyclic reductions described in Section 3 to specific complete genomes of the human immunodeficiency virus type 1 and the hepatitis C virus.

Relative Entropy Study of the HIV1 BRUCG Isolate
Following Section 3.2, the data indexed by the cyclic orbits are simply the sums x s of the frequency counts x S with which the sequence S occurs in a given region of the genome, that is, The frequency counts were evaluated by scanning the genome one base at a time in the 5 −3 direction. The sequence in FASTA format was downloaded from the NCBI website (http://www.ncbi.nlm.nih.gov). Computations were evaluated using the Symmetry Computing Toolbox (Symmetry Computing Toolbox, c M.Viana). This particular HIV1 isolate, used here for numerical illustration only, also appears in the study of the HIV1's evolutionary properties [11].
The frequency counts were obtained for the complete genome of human immunodeficiency virus type 1, isolate BRU (LAV-1), sequence ID gi:326417, accession number K02013.1, HIVBRUCG. See [23]. The full 9229 bp-long sequence was partitioned into six equal-length adjacent regions numbered 1-6, where the cyclic summaries x s were evaluated. The frequency counts for the conjugated cyclic orbits of act corresponding to Diagram (10) are shown in (22) and (23), whereas in (24) and (25) show the frequency counts for the conjugated cyclic orbits of gct corresponding to Diagram (11).  Figure 3 shows the resulting relative entropy invariants, as defined in Section 3.2, both for the act and for the gct conjugated cyclic orbits. Figure 3. The top profiles show (respectively to the legends shown top to bottom) the relative entropy for the joint distributions of (act, tca, tga, agt); (act + tca, tga + agt); (act + tga, tca + agt), and (act + agt, tca + tga). The bottom profiles display the corresponding results for the conjugated orbits of gct.
For example, reading from Region 3, in Equations (24) and (25) Figure 3. The top part of Figure 3 shows the corresponding profiles for the GCT conjugate cyclic orbits.

Statistical Assessment
The statistical assessment of the entropy can be obtained by numerically evaluating its sampling distribution based on 10,000 randomly generated observations from the posterior (Beta) distribution conjugated to binomial likelihood for the data, relative to the uniform prior probability distribution. Based on the resulting sampling distribution, a numerical evaluation of a posterior 95% credibility interval (CI) for the relative entropy can be obtained.
For example, reading from Region 5, in (24) and (25), we have, is the relative entropy for the polarity-direction interaction or residual term. The credibility intervals thus suggest that the drop in polarity uncertainty in Region 5 is statistically distinct from the other two uncertainties and from the relative entropy of a uniform distribution of the same total binomial size, namely (104, 104), which is (0.983009, 0.999997). Figure 4 shows the posterior 95% credibility bands for the relative entropy of binomial distributions of total sample size n = x gct + x tcg + x cga + x agc = 208, from which the above credibility intervals can be identified. The range of the monogram is half the total binomial sample because of the entropy (orbit) invariance property.

Orbit Diversity Decomposition for the HIV1 Samples
In this section we apply the canonical decomposition introduced in Section 3.4 and evaluated in Appendix C to reduce the diversity data shown in Equation (1) indexed by the joint set of conjugated orbits (the conjugated orbits of act adjoined to the conjugated orbits of gct), using the D 4 action defined in Section 3.3. The orbit diversity for the joint set of conjugated orbits is shown in (26)  The inclusion of the error (due to sampling variability) term in the canonical decomposition for the sample is obtained by tensoring the decomposition induced by the representation of interest, shown in Appendix C, with the standard canonical decomposition [12] (Chapter 4) where A is the n × n projection matrix with all entries equal to 1/n, I n is the n × n identity matrix and Q = I − A. The canonical decomposition for the sample is then I gn = I g ⊗ I n = (P 1 + P α + P γ + + P γ − + P β ) ⊗ (A + Q) from which we obtain the (multivariate) analysis of variance shown in 27, where x (P ξ ⊗ A)x are the sample mean effects, x (P ξ ⊗ Q)x the sampling error terms, and x (I g ⊗ Q)x the total sampling error and g is the group order. More specifically, it is assumed that where e = (1, . . . , 1) with n components, µ is the vector of dihedral means, and Σ is the dihedral covariance structure. It then follows, for all P ξ , that Because A = A, A 2 = A and Ae = e, we have Similarly, (P ⊗ Q)x ∼ N ((Pµ) ⊗ e, (PΣP ) ⊗ Q)).
The degrees of freedom in each case are obtained by the traces of the corresponding projections, which are also equal to the dimension of the projecting (invariant) subspaces. Under suitable parametric assumptions the magnitude of the ratios can be assessed by (typically non-central) F-distributions with n 2 ξ and g(n − 1) degrees of freedom. The corresponding underlying parametric hypotheses µ P ξ µ = 0 are those introduced earlier in Section 3.4.
Under large-sample parametric assumptions and independent dihedral covariance structure it follows that, with the exception of the contrast associated with γ − , all F-ratios are significantly high (statistically distinct from zero).

Orbit Diversity Decomposition for the HCV Samples
This section replicates the methods described in Section 4.3 for a sample of 10 Brazilian hepatitis C sequences. The orbit diversity for the joint set of conjugated orbits for each sequence in the sample is shown in (28). Their accession numbers are referenced in Appendix B.1. s act tca tga agt gct tcg cga agc The corresponding analysis of variance decomposition is shown in (29).  (27) and (29), that the two viruses have significantly distinct joint cyclic diversity profiles. Additional numerical studies are referenced in Appendix B.

Summary
In this communication we constructed dihedral D 2 reduction of conjugate injective cyclic orbits in length of three, a dihedral D 4 reduction of their combined set, and a dihedral D 3 reduction of the set of conjugate injective cyclic orbits in length of four. In each case, the experimental scalar data can be any summary obtained over the cyclic orbits, such as the sum or an extreme value of the frequency counts over the cyclic orbit, the entropy of a frequency distribution over the orbit, its amino acid content, or, as in [11], the orbit's frequency diversity. In the case of matrix data, the data-analytic methods of group rings, instead of group algebras would then be the appropriate methodology [14].

A. HIV1 and HCV Sequences
The following are the accession numbers for the HIV1 and HCV sequences considered in the present study: The relative entropy evaluations illustrated above in Section 4.1 were replicated for a sample of 10 Brazilian HIV1 sequences, referenced in Appendix A. The raw frequency counts and the corresponding relative entropy profiles for each of 10 sequences are linked in [24].

B.2. Relative Entropy Study of 10 Brazilian HCV Sequences
Similarly to the study for the HIV1, a sample of 10 Brazilian hepatitis C sequences was evaluated for their relative entropy. The sequences are referenced in Appendix A. The raw frequency counts and the corresponding relative entropy profiles along each genome are linked in [25]. The relative entropy invariant profiles clearly highlight the structural differences between the two types of viruses.

B.3. Relative Entropy Study of Random Reference Sequences
It is statistically useful to compare the cyclic reductions obtained for HIV1's isolate described above with those from random DNA sequences of comparable lengths. The results, based on 20 random sequences, shown in [26], clearly indicate that the observed variations in relative entropy (invariants) for the conjugated gct orbits, both for HIV1 and HCV sequences, are well below what one would expect to observe for random sequences of comparable lengths.