Discrete Circular Distributions with Applications to Shared Orthologs of Paired Circular Genomes

For structural comparisons of paired prokaryotic genomes, an important topic in synthetic and evolutionary biology, the locations of shared orthologous genes (henceforth orthologs) are observed as binned data. This and other data, e.g., wind directions recorded at monitoring sites and intensive care unit arrival times on the 24-hour clock, are counted in binned circular arcs, thus modeling them by discrete circular distributions (DCDs) is required. We propose a novel method to construct a DCD from a base continuous circular distribution (CCD). The probability mass function is defined to take the normalized values of the probability density function at some pre-fixed equidistant points on the circle. Five families of constructed DCDs which have normalizing constants in closed form are presented. Simulation studies show that DCDs outperform the corresponding CCDs in modeling grouped (discrete) circular data, and minimum chi-square estimation outperforms maximum likelihood estimation for parameters. We apply the constructed DCDs, invariant wrapped Poisson and wrapped discrete skew Laplace to compare the structures of paired bacterial genomes. Specifically, discrete four-parameter wrapped Cauchy (nonnegative trigonometric sums) distribution models multi-modal shared orthologs in Clostridium (Sulfolobus) better than the others considered, in terms of AIC and Freedman’s goodness-of-fit test. The result that different DCDs fit the shared orthologs is consistent with the fact they belong to two kingdoms. Nevertheless, these prokaryotes have a common favored site around 70° on the unit circle; this finding is important for building synthetic prokaryotic genomes in synthetic biology. These DCDs can also be applied to other binned circular data.


Introduction
Most prokaryotic genomes (1158 out of 1194, NCBI, August 2010) are made up of single circular chromosome (henceforth called circular genomes). Here, our emphasis is on the structure of circular genomes, which plays an important role in synthetic and evolutionary biology. For example, the most and least favored regions in which shared orthologous genes (henceforth orthologs) between bacterial genomes are located are of interest and important; orthologs are genes directly evolved from an ancestoral gene [Tatusov, Koonin and Lipman (1997)], and can be traced through different species across evolution. Further, genome organization may influence gene expression, which is vital for organisms [Carrera, Rodrigo and Jaramillo (2009)].
While studying structural comparisons between paired circular genomes [Shieh, Zheng, Johnson et al. (2011)], we found that the locations of shared orthologs are often observed as binned data. Since shared orthologs and other data, e.g., wind directions recorded at monitoring sites and intensive care unit arrival times on the 24-hour clock, are counted in binned circular arcs, modeling them by discrete circular distributions (DCDs) is required, which was the motivation for this study. Circular (angular) data can be represented as points on the circumference of a unit circle, e.g., wind directions at a monitoring site, and others [Fisher (1993); Jammalamadaka and SenGupta (2001); Johnson and Wehrly (1977); Mardia and Jupp (2000)]. Circular data are modeled by distributions on the circle, namely circular distributions. Continuous circular distributions (CCDs) can be generated using several methods such as projection and wrapping. The wrapping approach is an effective method for generating a probability density function (pdf) on the circle from a pdf on the line. For example, a wrapped normal distribution is constructed by wrapping a normal distribution onto the unit circle. Similarly, any distribution on integers can be wrapped around the circumference of a unit circle to construct the probability mass function (pmf) on the circle. For instance, the wrapped Poisson distribution is constructed by wrapping Poisson distribution onto the unit circle [Mardia and Jupp (2000)]. Moreover, wrapped discrete skew Laplace (WDSL) and wrapped geometric distributions have been studied [Jayakumar and Jacob (2012); Jacob and Jayakumar (2013)], respectively. Recently, Mastrantonio and colleagues introduced the invariant wrapped Poisson (IWP) distribution and investigated the invariance properties [Mastrantonio, Jona Lasinio, Maruotti et al. (2015)]. The wrapping method can immediately constuct the corresponding circular distribution from a discrete distribution on R 1 ; however, the normalizing constants are not analytic in general.
In this article, we propose a novel method to construct DCDs for discrete circular data. The constructed DCD is defined on the n (≥2) pre-fixed equidistant points 2π j/n (j=0, 1, 2, …, n−1) of the circumference of a unit circle from the pdf f(θ) of the corresponding CCD on [0, 2π). The pmf of the DCD from a base pdf f(θ) is defined by with the normalizing constant C n ¼ P nÀ1 j¼0 f 2pj n . We note that in general, the number of divisions is determined by domain knowledge. For example, n=24 is commonly used for hourly data such as gun crime data [Mastrantonio, Jona Lasinio, Maruotti et al. (2015)] and intensive care unit arrival times on the 24-hour clock, while n=8 and 16 are used for wind directions at a monitoring site. As n ! 1, the explicit form of p n (j) is shown in Appendix A.
In the following section, we derive the normalizing constant of Eq. (1), then we present the characteristic function and trigonometric moments of DCDs. Next, families of discrete distributions constructed from von Mises, cardioid, nonnegative trigonometric sums [Fernández-Durán (2004)], wrapped Cauchy and four-parameter wrapped Cauchy [Kato and Jones (2015)] distributions are illustrated, and their trigonometric moments are presented. Furthermore, we study how well DCDs and the corresponding CCDs model the grouped circular data, and compares two estimation methods via Monte Carlo simulations. Finally, we apply the constructed DCDs, IWP and WDSL to model marginals of the shared orthologs between a pair of circular genomes; these prokaryotes belong to different kingdoms. The fitting results of our DCDs, IWP and WDSL are also compared. We close with some discussion in Section 6.

Methods
In this section, we derive the expressions of the normalizing constant and trigonometric moments of the pmf in Eq. (1); the former is required by a DCD while the latter are basic properties. For this purpose, the following proposition is useful.
Hereafter, we suppose that f(θ) is represented by a Fourier series where f p , α p and β p represent the characteristic function, pth cosine moment and pth sine moment of a circular random variable Θ with pdf f(θ) respectively, i.e., f p =E[e ipΘ ], α p =E[cos(pΘ)] and β p =E[sin(pΘ)].

The formula for the normalizing constants
From Proposition 1 and the Fourier series expansion for f(θ), we find the equation, for k=0, 1, 2, …, n−1, For k = n − 1 in the above equation, it holds that This equation leads to the identity of the normalizing constant C n in (1) as

Trigonometric moments
Suppose that a random variable Θ n follows a DCD with the pmf (1). To obtain its characteristic function f n,q =E[e iqΘn ]=α n,q +iβ n,q , it suffices to calculate the cases q=0, 1, 2, …, n−1, because obviously, f n,q+kn =f n,q for k¼1; 2; 3; . . . , and α n,−q =α n,q and β n,−q = −β n,q . From Proposition 1, we have This characteristic function leads to the expressions for the qth cosine and sine moments as a n;q ¼

Results
In this section, we show specifically how five families of DCDs are constructed from the corresponding CCDs. We use discrete von Mises, cardioid, nonnegative trigonometric sums [Fernández-Durán (2004)], wrapped Cauchy and four-parameter wrapped Cauchy [Kato and Jones (2015)] distributions to illustrate the construction. Similarly, other discrete circular distributions can be constructed.

Discrete von mises distribution
The pdf of a von Mises distribution with mean direction μ and concentration κ, VM(μ, κ), is for κ ≥ 0 and 0 ≤ μ < 2π. The pth cosine and sine moments are expressed by α p = I p (κ)cos(pμ) /I 0 (κ) and β p = I p (κ) sin(pμ)/I 0 (κ) respectively, where I p (·) is the modified Bessel function of the first kind and order p defined by The pmf of the corresponding discrete von Mises distribution is expressed by ; ¼ e j cosð2pj=nÀlÞ nfI 0 ðjÞ þ 2 P 1 p¼1 I np ðjÞ cosðnplÞg ; j ¼ 0; 1; 2; . . . ; nÀ1; which involves Bessel functions in the normalizing constant. This distribution is denoted by DVM n (μ, κ) or DVM n for short. For calculating this pmf numerically, the first expression is useful for small n, while the second one is useful for large n since I np (κ)<e 0.25z 2 (0.5z) np /Γ(np+1) and it converges to 0 quickly as p ! 1. From (2), the trigonometric moments are expressed as ðmod qÞ I 0 ðjÞ þ 2 P 1 p¼1 I pn ðjÞ cosðpnlÞ ðmod qÞ:

Discrete nonnegative trigonometric sums distribution
A family of nonnegative trigonometric sums distributions with order M, denoted as NTS M , is defined as with real numbers r m and c m such that P M m¼0 ðr 2 m þc 2 m Þ ¼1. To estimate the parameters, some constraint, like c 0 = 0, is imposed in order to make the parameters of the model identifiable [Fernández-Durán (2007)].
Since the pth cosine and sine moments of NTS M are α p =a p and β p =b p for p ¼ 1; 2; 3; . . . ; M and α p =β p =0 for p>M, the pmf of the corresponding discrete nonnegative trigonometric sums distribution with order M and division n is immediately obtained as . . . ; n À 1 and denoted by DNTS M ;n ða m ; b m ; m ¼ 1; 2; 3; . . . ; M Þ or DNTS M,n for short. DNTS M,n has 2M-parameters a m and b m ðm ¼ 1; 2; 3; . . . ; M Þ, and hereafter we suppose that the number of the equidistant points on the circle (n) is greater than or equal to 2M, or M ≤ n/2, for model identifiability.
From Proposition 1, the cumulative sum of p NTS (j) is expressed by . . . ; n À 1: The trigonometric moments are M < q < n À M ; a nÀq À ib nÀq ; n À M q n À 1: Note that the trigonometric moments of DNTS M,n are the same as those of NTS M . This means that the discretization from NTS M to DNTS M,n preserves the mean direction and the mean resultant length. When n=2M, the moments are expressible by This family includes discrete cardioid distribution (DC n ) as a special case.

Since
the pmf of the corresponding discrete wrapped Cauchy distribution, denoted by DWC(μ, ρ) or DWC n for short, is expressed by . . . ; n À 1 and its trigonometric moments are 3.5 Discrete four-parameter wrapped cauchy distribution The pdf of a four-parameter wrapped Cauchy (FWC) distribution is In this model, μ and γ play the roles of location and concentration respectively, and λ and ρ controls skewness and kurtosis of the distribution [Kato and Jones (2015)]. As a special case, this distribution reduces to the wrapped Cauchy distribution when λ=0 and γ=ρ. The pth cosine and sine moments are α p =γρ p−1 cos{p(μ+λ)−λ} and β p =γρ p−1 sin {p(μ+λ)−λ}, respectively.
Since X 1 p¼1 cq npÀ1 e ifnpðlþÞÀg ¼ cq nÀ1 fe inðlþÞÀ À q n e i g 1 þ q 2n À 2q n cosfnðl þ Þg ; taking its real part, we have Thus, the pmf of the corresponding four-parameter discrete wrapped Cauchy distribution is expressed by . . . ; n À 1 and is denoted by DFWC n (μ, ρ, λ, γ) or DFWC n for short. It can be shown that plugging λ=0 and γ=ρ into p FWC (j) results in p WC (j) of the discrete wrapped Cauchy distribution. The trigonometric moments are omitted to save space.

Estimation and simulation
In this section, we utilize two methods to estimate the parameters of the constructed DCDs. Next, we compare these methods under different sample sizes and number of divisions via a simulation study.

Estimation
Maximum likelihood based inference for grouped circular data is mentioned in Pewsey et al. [Pewsey, Neuhäuser and Ruxton (2013), Section 6.3.5]. As a special feature of our discretization method, however, there is a relationship between the log-likelihood functions for the base CCD and the corresponding DCD as follows.
Let f j , j=0, 1, 2,…, n−1, be the frequency observed at the value 2πj/n on a circumference and N¼ P nÀ1 j¼0 f j be the sample size of the dataset. Then, the log-likelihood function of DCD in (1) with parameter vector ψ for the dataset is given by f j log f ð2pj=n; wÞ À N log C n ðwÞ; which is equal to the log-likelihood function of the base CCD subtracting the penalty term N log C n (ψ) for the DCD. From this expression, the maximum likelihood estimation (MLE) of DCD does not correspond to that of its base continuous circular distribution in general. An alternative approach to estimate parameters of DCD is the minimum chi-square estimation (MCSE), where the estimates are given by minimizing Pearson's chi-square test statistic CSðwÞ ¼ X nÀ1 j¼0 ff j À Np n ðjÞg 2 Np n ðjÞ :

Simulation
We conducted a simulation study to compare how well the original CCDs and the corresponding DCDs model grouped data, in which various sample sizes N were generated with 5,000 Monte Carlo repetitions.
For a continuous distribution whose distribution function is not analytic, it is difficult to generate random numbers directly; see Appendix A for further explanation. Thus, we first generated random numbers from DCD with large divisions m=1,152 (=9×2 7 ) using the method in Appendix A, then we grouped them into equidistant division n=9, 18 and 36 of [0, 2π), each number in [2πj/n, 2π (j+1)/n) was rounded to (2j+1)π/n. The number of divisions 36 and 18 were chosen, because the number of orthologs in every 10°and 20°arcs were of interest in the Application section, while n=9 was studied as a small number of divisions.
From these rounded numbers, we estimated the parameters of the base CCD and the corresponding DCD by MLE and MCSE in the Estimation section and compared the fitting performance.
As measures of fitting performance, we use the chi-square test statistic v 2 ¼ X nÀ1 j¼0 ðf j À np j Þ 2 np j and Freedman's test statistic [Freedman (1981)], which is a modified version of Watson's U 2 statistic for discrete data, j ¼ 0; 1; 2; . . . ; nÀ1: Because the exact distributions of χ 2 and U *2 are intractable, the exact p-values are difficult to calculate. Thus, we calculated the approximated p-values by the bootstrap method. Specifically, we generated bootstrap samples χ 2(b) of χ 2 and U *2(b) of U *2 , b=1, 2, 3, …, B and B=10,000, under the true parameters. Next, we calculated bootstrapped p-values p B v 2 := #{χ 2(b) > x, b=1, 2, 3, …, B}/B and p B U Ã2 := # {U *2(b) > u, b=1, 2, 3, …, B}/B, where #A is the number of elements in the set A, and x and u are the realizations of the chi-square and Freedman's test statistics obtained from the estimated distribution, respectively.
We conducted the simulation for discrete cardioid, wrapped Cauchy and four-parameter wrapped Cauchy distributions. The results are very similar, so we only show the result of discrete four-parameter wrapped Cauchy distributions and omit the others. Tab. 1 shows the bias and mean squared error (MSE) of the estimates by MLE for FWC n (π, 0.6, π/2, 0.3) and those by MLE and MCSE for DFWC n (π, 0.6, π/2, 0.3) with their measures of fitting performance p B v 2 and p B U Ã2 . The original CCD fits the data as well as the corresponding DCD, only when the sample size is small (N=50) and the number of divisions is large (n=36). While in other cases, the DCD performs better than its corresponding CCD in terms of p B v 2 and p B U Ã2 . This result is natural since large n leads to a continuous sample while small n leads to a discrete sample.
Comparing the estimations, we see that when the number of divisions is large (n=36 for N ranging from 50 to 200), MLE is better than MCSE in the sense of p B U Ã2 , small bias and MSE of the estimated parameters. However, when the number of divisions and sample size are not large (n=9 and 18; N=50 and 100, respectively), MCSE outperforms MLE in most cases for estimating parameters and fitting to discrete circular data.
Our algorithm was written in Mathematica 8.0 and we optimized the functions L(ψ) and CS (ψ) in Section 4.1 by the NMaximize and NMinimize commands of the Nelder-Mead method. The running time (CPU: Core i7-5820K, 3.30 GHz) of the simulation was very fast. For N=50, 100 and 200 simulated data with the number of divisions n=9, 18 and 36, the running time was bounded by data grouped by 36 divisions. Of these, applying MLE and MCSE to DFWC 36 took 2.0 and 0.1 seconds, respectively, while applying MLE to FWC took about 2.8 seconds.

Application
Most prokaryotic (bacteria and archaea) genomes are made up of single circular chromosomes, called circular genomes. Orthologs are genes in different species that evolved directly from a common ancestral gene by speciation, and they can be used to  [Shieh, Zheng, Johnson et al. (2011)]. Comparisons of structures, e.g., the most and least favoured spots, between paired genomes are important and useful in the area of synthetic and evolutionary biology. The data were downloaded from NCBI (ftp://ftp.ncbi.nlm.gov/refseq/release/bacteria), and were preprocessed as stated in Shieh et al. [Shieh, Zheng, Johnson et al. (2011)] The processed data are available at http://www.stat.sinica.edu.tw/gshieh/DCDs/data.txt.
When comparing the genomes of paired bacteria, e.g., Clostridium and Sulfolobus, whose genomes were plotted at http://www.stat.sinica.edu.tw/gshieh/DCDs/genomes.pdf, in the different kingdoms, their shared orthologs are discrete when depicted in binned circumferences on unit circles. Fig. 1 shows rose diagrams of the shared orthologs between Clostridium and Sulfolobus. The distributions of the shared orthologs in both genomes look different, because they have evolved to belong to different kingdoms. Henceforth, we call the binned shared orthologs in Clostridium and Sulfolobus the Clostridium and Sulfolobus data, respectively, which are sometimes presented in degrees on the circumference of a unit circle, for clarity of depiction. The total number of shared orthologs N is 192 for both data sets. We use 18 pre-fixed equidistant (20°) bins and center the number of shared orthologs in each bin between 20j and 20(j+1) degrees for j ¼ 0; 1; 2; . . . ; 17, namely at 10 ; 30 ; 50 ; . . . ; 330 . For the Clostridium data, the sample mean direction is h ¼ 0:52 (in radian) and the sample mean resultant length is R ¼ 0:19 with p-value 0.001 of the Rayleigh test for uniformity based on 2N R 2 ¼ 13:41 which means that the Clostridium data is not uniform on the circle; for the Sulfolobus data, they are h ¼ 0:18 and R ¼ 0:14 with 2N R 2 ¼ 7:03 and p-value 0.03, respectively. Fig. 1 shows that the distribution of the Clostridium data looks asymmetric and has the mode around 350°, followed by 330°and 70°, while the Sulfolobus data looks symmetric and is multimodal with modes around 70°, 250°and 270°, respectively.
When fitting distributions to the Clostridium and Sulfolobus data, we apply the MCSE to estimate the parameters of DVM n , DC n , DNTS 2,n , DWC n and DFWC n . In addition, we also fit uniform distribution and two recently studied DCDs (IWP and WDSL), and compare their performances to those of our constrcuted DCDs. IWP assumes the following pmf.
. To apply this distribution, we need to approximate the infinite summation by truncation, and the truncation point depends on the parameter λ. For stable calculation, we restrict the parameter space of λ to 0<λ<60 and fix the truncation point at 100.
The WDSL distribution (Jayakumar and Jacob, 2012) has the following pmf.
Tabs. 2 and 3 show the MCSE for the parameters of each distribution, model selection criterion AIC and bootstrapped p-value of the Freedman's goodness-of-fit statistic p B U Ã2 for the Clostridium and Sulfolobus data, respectively. In the sense of AIC and p B U Ã2 , DFWC 18 gives the best fit for the Clostridium data with the smallest AIC and p-value 0.629. For the Sulfolobus data DNTS 2,18 gives the best fit with the smallest AIC and the largest p-value 0.882 (the significance level equal to 0.05). Figs. 2 and 3 show the linear histograms of the Clostridium data and Sulfolobus data, respectively, superposed by well-fitted distributions as line plots. As shown in Fig. 3, DNTS 2,18 models the Sulfolobus data better than the remaining ones.
Furthermore, in terms of running time (CPU: Core i7-5820K, 3.30 GHz), our constructed DCDs took less than 0.5 seconds, while IWP and generalized WDSL took 900 and five seconds, respectively. This is because IWP and generalized WDSL have discrete location  We summarize the modeling of the Clostridium and Sulfolobus data by DCDs as follows.
1. For the Clostridium data, asymmetric DCDs yield good fits and, among these, the DFWC 18 yields the best fit in terms of AIC and Freedman's goodness-of-fit test. 2. For the Sulfolobus data, DNTS 2,18 gives the best fit in terms of AIC and Freedman's goodness-of-fit test. Fig. 3 also shows that DNTS 2,18 fits the data better than the second best DFWC 18 , in terms of the line plots fitting to the linear histogram of the data. Note that DFWC 18 can not exhibit bimodality. 3. The above results suggest that these shared orthologs are distributed differently, which is consistent with the fact that these prokaryotes belong to different kingdoms. Nevertheless, the rose diagrams ( Fig. 1) show that these shared orthologs have a common favored region (near 70°) out of two to three favored regions. This finding is important for building synthetic prokaryotic genomes in synthetic biology.

Conclusions
We have investigated the construction of DCDs by generating the pmfs from circular pdfs. The normalizing constant is simply represented by the cosine moments of the base CCD, and the constructed discrete distributions are tractable. Simulation studies show that DCDs outperform the corresponding CCDs in modeling grouped (discrete) circular data, and MCSE is better than MLE. The constructed DCDs, IWP and WDSL were applied to compare the structures of shared orthologs in a pair of prokaryotes, an important topic in synthetic and evolutionary biology. Specifically, DFWC (DNTS) distribution is shown to model multi-modal shared orthologous genes in bacteria Clostridium (archaea Sulfolobus) well. We conclude that of the distributions considered, DFWC n fits the asymmetric Clostridium data the best in terms of AIC and Freedman's goodness-of-fit test, and DNTS 2,n fits the symmetric and multi-modal Sulfolobus data better than the remaining DCDs considered. Although these shared orthologs followed different DCDs, they do have a common favored region around 70°out of two to three top-favored regions. This finding is important for building synthetic prokaryotic genomes in synthetic biology.
The constructed DCDs are versatile, namely these families of distributions can fit both unimodal and multi-modal and symmetric and asymmetric discrete circular data. Moreover, the computation of our algorithm, consisting of estimation of the parameters and goodness of fit test, is very fast. The presented discretization method can be applied to any univariate CCDs, but it is limited to univariate circular distributions. Nevertheles, our method is the basis for bivariate DCDs which is of interest and have applications in many scientific areas. Therefore, we leave construction of discrete bivariate circular models for future study.