Probabilistic model based on circular statistics for quantifying coverage depth dynamics originating from DNA replication

Background With the development of DNA sequencing technology, static omics profiling in microbial communities, such as taxonomic and functional gene composition determination, has become possible. Additionally, the recently proposed in situ growth rate estimation method allows the applicable range of current comparative metagenomics to be extended to dynamic profiling. However, with this method, the applicable target range is presently limited. Furthermore, the characteristics of coverage depth during replication have not been sufficiently investigated. Results We developed a probabilistic model that mimics coverage depth dynamics. This statistical model explains the bias that occurs in the coverage depth due to DNA replication and errors that arise from coverage depth observation. Although our method requires a complete genome sequence, it involves a stable to low coverage depth (>0.01×). We also evaluated the estimation using real whole-genome sequence datasets and reproduced the growth dynamics observed in previous studies. By utilizing a circular distribution in the model, our method facilitates the quantification of unmeasured coverage depth features, including peakedness, skewness, and degree of density, around the replication origin. When we applied the model to time-series culture samples, the skewness parameter, which indicates the asymmetry, was stable over time; however, the peakedness and degree of density parameters, which indicate the concentration level at the replication origin, changed dynamically. Furthermore, we demonstrated the activity measurement of multiple replication origins in a single chromosome. Conclusions We devised a novel framework for quantifying coverage depth dynamics. Our study is expected to serve as a basis for replication activity estimation from a broader perspective using the statistical model.

139 Circular distributions and statistics 140 The distributions and statistics used in this study are shown in Table 1 144 Ruxton, 2013). In addition to major distributions, we introduced a linear cardioid distribution and 145 exponential linear cardioid distribution to evaluate the coverage depth trend. These functions are 146 symmetric around the location parameter, and the integral around a unit circle is 1; i.e.,

147
. For each distribution, the probabilistic PTR can be ∫ π π ( | , ) = ∫ 2π 0 ( | , ) = 1 148 analytically defined as the ratio between the maximum and minimum value of the probability 149 density function (see the Statistical model to estimate replication rate section for details). 150 Some of the models (von Mises, cardioid, wrapped Cauchy, and Jones-Pewsey) were 151 symmetrically or asymmetrically extended with or without inverse transformation, as previously 152 described ( (1 + )sin ( -) 160 not change the normalization constant. To make the distribution asymmetric with mode-161 invariance, we used the mode-invariance asymmetric transformation extension (MIAE) or 162 inverse-transformed mode-invariance asymmetric transformation extension (InvMIAE). These 163 transformations satisfy the requirement that asymmetricity be analyzed in replication. As 164 replication begins at the origin irrespective of the rapidity of bacterial growth, the highest 165 coverage depth position is preserved regardless of the asymmetry level. The skewness parameter 166 must not affect the pPTR when the pPTR and skewness are measured independently. In these 167 transformations, the symmetricity around the mode changes via an additional skewness 168 parameter, while the location parameter, concentration parameter, and pPTR remain unchanged 169 even if the skewness parameter changes. MIAE transforms the angular variable into ( ) = -

170
, where is the skewness parameter. This transformation also requires a normalization  (1) 216 As is considered to be continuous rather than discrete for the purpose of these distributions, we 217 confirmed that the parameters could be estimated appropriately (Supplementary Text S1). 218 Equation (1) coincides with a term that changes with the probability parameter in the log-219 likelihood equation of the multinomial distribution shown in equation (2) although the sum of the 220 probability over all nucleotides is not 1 because the distribution is not discrete: Following the model with the likelihood represented by equation (1), the location 223 parameter corresponds to the position of the replication origin as long as the concerned 224 chromosome does not have multiple replication origins. Contrastingly, the concentration 225 parameter is associated with growth as it determines the shape of the distribution. Therefore, we 226 allowed the location parameter to be shared among all of the samples and the concentration 227 parameter for each sample to be independent. For the concentration parameters, we set the half-228 Student's t-distribution as a prior distribution (Gelman, 2006). We set 2.5 as the shape 229 parameter; 0 as the location parameter; and 0.2 (von Mises, Jones-Pewsey), 0.1 (cardioid), 0.17 230 (wrapped Cauchy), 0.105 (linear cardioid), and 0.1103(exponential linear cardioid) as the scale 231 parameters. These were selected such that the value of the cumulative probability density 232 function became nearly 0.8 when the PTR was 2.0. This characteristic suggests that most of the 233 PTRs are distributed between 1.0 and 2.0 in an environment. The distribution of the coverage 234 depth PTR in a previous study rationalizes this suggestion . For the degree of 235 density of the Jones-Pewsey distribution, the peakedness of the symmetric extended distribution, 236 and the skewness parameter of the asymmetry extended distribution, we set the Gaussian 237 distribution with a location parameter of 0 and a scale parameter of 1.0 as the prior distribution to 238 avoid overfitting. From the model, we introduced an estimate that expresses the degree of 239 growth. It is known that many microbes in prokaryotes replicate their chromosomal DNA on 240 both sides from the origin such that the apparent amount of DNA increases near the origin. This 241 behavior introduces a latent bias that makes DNA near the origin more likely to be observed 242 during replication. This bias is simply expressed by the concentration parameter in a circular 243 distribution. However, for consistency with the previous study , we defined a 244 probabilistic PTR (pPTR), which is the ratio of the maximum of probability density function to 245 the minimum, i.e., , as a growth dynamics index. This score represents the . Then, the ratio is likely to assume a similar value, which defines 270 the equality of the mixture. As the activity index for multiple replication origins, we defined a 271 weighted pPTR (wPTR) and mean-weighted pPTR (mwPTR). The wPTR of the mth replication 272 origin is computed via a weighted concentration parameter using a mixture ratio, where mwPTR 273 is the average of these. For example, the wPTR of the von Mises-based model is given by exp(2

274
, and mwPTR is given by . These scores are based on the model that 301 For sampling, we set the number of chains to 1 and the number of iterations to 500; the first 300 302 iterations were considered a warm-up and discarded. From the samples, we calculated the EAP 303 as a representative model parameter. The convergence of the sampling was checked using Rhat 304 statistics. If the statistics were equal to or less than 1.1, the result was accepted (Gelman et al., 305 2013). We used composite Simpson's algorithm with 20 subintervals to calculate the integrals in 306 the models. We computed the inverse transformation of the functions using Newton's method 307 when the skewness or peakedness parameter was less than or greater than 0.8. These 308 thresholds were evaluated by performing manual simulations such that the transformations did 309 not oscillate. We checked the convergence of the function as to whether the error was less than 310 the machine precision of a float-type variable defined using Stan. The Illinois method was used 311 in other cases (Dowell & Jarratt, 1971). Then, we checked the convergence of the function, as to 312 whether the error was less than 1.0×10 -13 . In both methods, we defined a maximum iteration 313 count that terminated the calculation at 30 iterations for Newton's method and 100 iterations for 314 the Illinois method. We set the mathematics transformation to reasonably estimate the 315 parameters. If a location parameter is estimated directly from 0 to , and the true location 2π 316 parameter is located near the edge of the range, the parameter estimation is likely to fail as it 317 does not detect cyclicity. We re-parameterized the location parameter as for the  (Li et al., 2009). Next, a moving median filter with a 100 nt window 331 size and a 100 nt stride length was applied. The moving median filter runs through the coverage 332 depths, replacing each coverage with the median of neighboring locations. We applied the 333 moving median filter for the following two reasons: to reduce the noise and outliers and to 334 reduce the data size, which affects the computational time of the model fitting. If the coverage 335 depth seemed to have noise regions that increased the coverage due to highly conserved regions 336 such as ribosomal genes, an additional filtering for outliers was performed; specifically, the top 337 1% of the coverage depth was removed and replaced with blank coverage. This decision and the 338 threshold, which were determined based on a previous study , were 339 independently evaluated using a frequency histogram of the coverage depth containing a noise 340 region, which increases the coverage in multiple datasets (Supplementary Text S3). Certain 341 regions that remained blank following filtering were filled with 0. As the WGS reads of H. 342 volcanii were separated into multiple FASTQ files, we concatenated them into a single file based 343 on growth conditions prior to alignment. To evaluate the error in the sequence edge parts, we 344 copied 263 nt in the head portion of the genome sequence to the tail portion. We also constructed 345 a graph genome sequence, circularized it, and mapped WGS reads to the graph genome sequence 346 using the variation graph toolkit (vg) with the default parameter set (Garrison et al., 2018). We 347 omitted the GC-content correction procedure to simplify the pipeline as the method using the 348 full-length genome sequence seemed to be robust (Brown et  The accuracy of the growth estimates was evaluated via comparisons with experimental growth 356 rates. Unless otherwise noted, the experimental growth rates were calculated from the colony 357 formation unit (CFU)/ml, optical density (OD), or relative abundance using = 358 following the approach taken in a previous study ; 359 e.g., the experimental growth for Supplementary Fig. 3c  . We added a time delay for the correlation coefficient between the experimental and 366 growth estimates following the approach of the previous study . The time 367 delay, which provided three or more combinations of the growth estimates and experimental 368 growth rates and yielded the highest correlation coefficient, was accepted. 369 370 371 Effect of normalization on the model 372 We evaluated the effect of normalization on the parameter estimation by checking the difference 373 between an estimated distribution and the true one. For the evaluation datasets, we generated 374 continuous angles from the von Mises distribution. These angles were binned into discrete angles 375 following the defined discrete length. We selected the location parameter from , , 0, and -ππ 2 π 2 376 ; the concentration parameter from 0.1, 0.4, and 0.7; the discrete length to be segmented from 5, 377 10, 30, 120, 600, and 1,000; and the average coverage depth to be observed from 0.5×, 1.0×, 378 2.0×, 4.0×, 8.0×, and 16.0×. Next, we fitted the unnormalized and normalized von Mises 379 distribution-based models to the simulated data. Finally, we evaluated the error of the estimated 380 distribution using the Kullback-Leibler divergence (Kullback & Leibler, 1951 (1 + 2 cos( -)) = 2π ∑ = 1 cos ( ) = 0 391 cardioid distribution, it was formulated as . The  (Seemann, 2014). Next, we mapped these predicted sequences to the genome sequence of 402 Lactobacillus gasseri using Bowtie2 and annotated regions as overlapping if two or more hits 403 were obtained. The parameter set of Bowtie2 was "-a --very-sensitive." Thereafter, we extracted 404 the partial coverage depth from 1.0-1.4 Mnt regions in which annotated features did not exist. To 405 obtain the odds ratio, 36 metagenomic sequences obtained from the previous study (Korem et al., 406 2015) were mapped, and the coverage depth was calculated via the method described above. 407 Next, the coverage depth for each sample was sorted, and the number of annotated bases in 5% 408 of the upper, lower, and total sequence was counted. We modeled the number n to follow a 409 binomial distribution with the total number of bases N of the target sequence and the appearance 410 probability as parameters: . The odds ratio was calculated as from ~Binomial( , ) = 1 -411 the appearance probability p. To estimate these parameters, we employed an MCMC algorithm 412 with NUTS using PyStan. Four chains were utilized, and 20,000 iterations were performed, 413 where the first 1,000 of these were discarded as warm-ups. The posterior distribution of the EAP 414 was used as the representative. 415 416 417 Evaluation of robustness in terms of coverage depth using culture dataset 418 To investigate the robustness of our proposed method, we compared the growth estimates 419 calculated using a sufficient amount of reads and those using rarefied reads. To evaluate 420 coverage depth dynamics from a single origin, we first used L. gasseri WGS samples with an 421 average of more than 20× coverage (n = 20). After confirming that many variations occur at 422 lower than 5.0× coverage, we selected Escherichia coli, Enterococcus faecalis, or L. gasseri 423 WGSs reads with more than 5× coverage from the dataset by Korem and colleagues. To evaluate 424 multiple origins of replication data, the WGS of S. solfataricus was used. We randomly sampled 425 reads from the FASTQ files using seqtk (Li, 2013)  . Based on the taxonomic profile, the combinations with 451 more than 0.1× coverage depth were selected as candidates (n = 16,413 combinations from 220 452 WGS samples) in the first screening. Next, we aligned the WGS reads of the datasets to the 453 database using Bowtie2 and calculated the coverage depth using SAMtools. After applying the 454 moving median filter and outlier elimination, we calculated the average coverage depth. If the 455 combination had more than 20.0× average coverage and passed the first screening, we concluded 456 that the combination was eligible to serve as an evaluation target (n = 676). For the targets, we 457 extracted paired-end aligned reads using SAMtools with the "view -f 2" option for PTRC. This 458 procedure was required because the GEM-mapper, which is used in PTRC, did not allow 459 singletons to be aligned. After that, we rarefied the reads using SAMtools with the "view -s" 460 option and converted the alignment result into a FASTQ file for the input. For our method, we 461 counted the coverage depth from the alignment result using SAMtools. After applying each 462 method, we calculated the error rate following equation (5). We used PTRC as a benchmark 463 method, as it provided the most stable estimation with low coverage in the culture WGS dataset. 464 505 506 Evaluation of robustness in terms of sample size 507 To assess the effect of the sample size on the estimates, partial sample sets were generated from 508 the full sample set, and the results were compared. We used the short-read sequences of E. coli, 509 E. faecalis, and L. gasseri published previously . The partial set was 510 configured to include 1, 4, 8, 12, 16, and 20 samples. Each set was distributed in a manner that 511 avoided duplication of the same sample. Except for the sample set with only one sample, the sets 512 were constructed to contain each sample at least 10 times. For each sample set, preprocessing 513 and inference were conducted according to the above procedure, and the error rate was 514 calculated in comparison with the results obtained when all of the samples were used 515 simultaneously.  (Parks et al., 2015) and largest genome size was used. Species without 563 complete genome sequences or those with multiple chromosome sequences were excluded. 564 Mobile genetic elements were excluded from the database by checking the sequence label using 565 seqkit; we filtered out the sequences labeled "plasmid," "Plasmid," "phage," "chromid," 566 "pMLa," and "Linear." 567 568 569 Growth estimate evaluation on metagenomic dataset 570 The growth dynamics were estimated at species-level resolution. We filtered low-quality reads in 571 WGS via Trimmomatic and then removed human-derived reads by aligning them to the GRCh38 572 reference human genomic sequence using Bowtie2. We used "SLIDINGWINDOW:4:15 MIN 573 LEN:36" as a parameter in Trimmomatic. After quality control, we aligned qualified 574 metagenomic reads with the complete genome sequence database using Bowtie2 with the "--575 very-fast" option. After extracting the alignment results of the target reference sequence, we 576 counted the coverage depth. As the metagenomic sequences were not clean compared to the 577 culture datasets, we performed additional filtering as described in the coverage depth calculation 578 method. After preprocessing was completed, we fitted the model to the coverage depth of each 579 sequence via the optimizing mode. For the estimations, we selected samples with greater than or 580 equal to 0.0001% relative abundance of the taxon and greater than 0.01× average coverage 581 depth. We used Kraken2 and Bracken with the complete genome sequence database to estimate 582 the relative abundance of the species. After filtering out the ultra-low coverage depth samples, 583 we excluded the samples that might not achieve random sampling from the chromosomal DNA 584 sequence. This is because the estimates of these samples would have an error. To detect the 585 invalid samples, we focused on the difference of actual zero coverage fraction f and a theoretical 586 score based on the Lander-Waterman theory (Supplementary Text S4). This theoretical score 587 can be obtained as , where a denotes the average coverage depth. For samples with ≈ ( -) 588 average coverage less than 5.0×, we excluded samples with log-scale fractions greater than 0.56 589 times the theoretical score. Assuming there to be uneliminated noise coverage depth, samples 590 with estimated PTRs greater than or equal to 3.0 were excluded. Welch's t-test for independent 591 groups was used to examine the differences between the growth estimates, and Hedges' g was 592 used to evaluate the effect size for the two groups. 593 594 595 Software 596 We implemented the statistical model using Stan (Carpenter et al., 2017). Wrapped by Python 597 scripts, this model is available for use in the command-line environment. This package also 598 contains a moving median filter, a visualizer, a statistics profiler based on directional statistics, 599 an information criterion calculator with estimated results, an asymmetric test calculator using 798 Here, we introduced a generative statistical model of coverage depth based on circular statistics 799 and evaluated the estimated growth dynamics, replication trend, and differences in wPTR among 800 multiple origins. In directional statistics, the simplest approach to expressing angular bias may be 801 the use of the MRL. Although the MRL of the coverage depth was correlated with the 802 experimental growth rate in the culture datasets, it is not as robust as estimates obtained via 803 statistical models. This statistic can be easily calculated even with poor computational resources, 804 but is not suitable for metagenomic datasets. Our proposed method was as accurate as the 805 previous methods when compared with the experimental growth rates, and furthermore, it was 806 robust against random mutations in the reference sequence and decreases in the coverage depth. 807 Conversely, it was sensitive to a decrease in coverage depth due to mutations concentrated in a 808 specific direction as well as to an increase in coverage depth due to conserved regions. In future 809 research, it is expected that the rapid increase or decrease in coverage depth will be modeled to 810 more accurately estimate the dynamics of the coverage depth. The simplest approach is not to 811 use coverage depth in regions that are expected to be ineligible, as has been done in previous 812 studies. However, this filtering approach alone does not provide a reasonable estimation for the 813 proposed model as it also uses the absence of observation for parameter estimation. If a valid 814 region can be assumed, ineligible regions could be excluded by normalizing the likelihood [ , ] 815 function to satisfy . In applying the proposed method to the coverage depth obtained 816 from the metagenomic sequence, the average coverage depth and random sampling properties 817 must be examined, as was done here. Although we did not utilize it in this study, one of the 818 advantages of a GLM is its ability to incorporate covariate effects into the model. If one wants to 819 evaluate the relationship between the covariates x and pPTR, it is suitable to use a link function 820 for the concentration parameter. For example, when the von Mises distribution model is used, let 821 be the coefficient of the covariates; then satisfies the requirement, i.e., = exp( 0 + 1 1 + ⋯)

822
. For a wrapped Cauchy distribution, the inverse logit function is appropriate to satisfy > 0

.
864 from the beginning of the culture. However, the degree of density remained low for an hour in 865 our study. The trend observed in the anaerobic cultured L. gasseri was the opposite of what was 866 seen in the others; however, it is worth noting that L. gasseri required 90-120 minutes of 867 adjustment to have a sufficient correlation between the experimental growth rate and estimated 868 growth dynamics. This suggests that both the activation of DNA replication and cell division are 869 required to decrease the degree of density. Accordingly, we inferred that the degree of density 870 and peakedness may indicate the activity of multiple replication forks. In contrast, the skewness 871 parameters in E. coli and L. gasseri did not change dynamically during the experimental 872 duration. Additionally, we confirmed the presence of a strong correlation between the skewness 873 of the Watson and Crick strands, implying that the amount of DNA remains symmetric between 874 the Watson and Crick strands as well as between the leading and lagging strands. 875 In addition to the application to microbes with a single replication origin, we extended 876 the model's application to microbes with multiple origins in a single chromosome. One 877 interesting finding was the difference in wPTR among multiple origins. From the relationship 878 between the intermediate position of the origins and the split position of the chromosome 879 sequence based on wPTR, the efficiency, in terms of the activity of the origins, was 880 quantitatively confirmed. If only a single replication origin was active in a chromosome, 881 considerable time could be required for whole-genome replication, which would be a 882 disadvantage for survival. By properly activating the origins at a distance, replication may be 883 efficiently completed. However, our results indicated that not all replication origins exhibit 884 similar activity. There are various characteristics that cause the activity to differ, such as 885 (a)synchronous initiation , replication fork speed (Elshenawy et al., 2015), 886 and so on. Therefore, the mechanism underlying the differences observed for each replication 887 origin must be clarified, and the characteristics of neighboring genes must be investigated. 888 The current study was affected by certain limitations. First, the proposed method requires 889 circular genome sequences for accurate estimation. As several methods involving contig or 890 scaffold-level sequences have already been proposed for estimating the quasi-growth of bacteria 891 Emiola & Oh, 2018;Gao & Li, 2018), it is recommended that these methods 892 be properly used depending on the accuracy requirements. It is difficult to detect trends in the 893 amount of DNA other than the coverage depth bias or to estimate the bias in chromosomes with 894 multiple replication origins using these methods. We consider our method to be appropriate for 895 data analyses related to detailed replication profiles. 896 Second, the taxonomic resolution is limited to the species level in our method, on account 897 of the first limitation. When the growth estimates of a reference strain were calculated using 898 metagenome samples containing different but closely related strains, their growth dynamics were 899 found to be different, indicating that the pPTR distributions may be mixed. This difficulty 900 regarding the taxonomic resolution has yet to be solved via growth rate estimation, which may 901 give rise to major challenges in environments such as soil, wherein many closely related species 902 are contained because of empty niches and/or microstructures (Dumbrell et (Langenheder & Szekely, 2011). This 908 resolution problem may be solved by constructing pan-genome sequences from metagenomic 909 reads and allocating coverage depth appropriately. Third, the evaluation scope of the extended 910 model is limited. Although we evaluated and eliminated the possibility of overfitting in our 911 dataset, we cannot deny the possibility that the dynamics of the peakedness and stability of the 912 skewness around the origin are specific to the three strains we used. External validations are 913 expected to confirm the variability of the peakedness or stability of the skewness over the growth 914 phase. Finally, in our method as well as all currently proposed methods for estimating bacterial 915 growth, the estimate itself is only a proxy of the growth rate. Theoretically, (p)PTR for a taxon t 916 in a sample s is represented by , where C t is the replication period and is