Dataset of normalized probability distributions of virtual bond lengths, bond angles, and dihedral angles for the coarse-grained single-stranded DNA structures

The utility of the coarse-grained (CG) single-stranded DNA (ssDNA) model can drastically reduce the compute time for simulating the ssDNA dynamics. The model-matched CG potentials and the inherent potential constants can be derived by coarse-graining the experimentally measured ssDNA structures. A useful and widespread treatment of the CG model is to use three different pseudo-atoms P, S, and B to represent the atomic groups of phosphate, sugar, and base, respectively, in each nucleotide of the ssDNA structures. The three pseudo-atoms generate nine types of the structural parameters to characterize the unstructured ssDNA conformations, including three (virtual) bond lengths (P-S, S-B, and S-P) between two neighbouring beads, four bond angles (P-S-P, S-P-S, P-S-B, and B-S-P) between three adjacent bonds, and two dihedral angles (P-S-P-S and S-P-S-P) between three successive bonds. This paper mainly presents the data of normalized probability distributions of the bond lengths, bond angles, and dihedral angles for the CG ssDNAs.


Specifications
Biophysics Specific subject area The coarse-grained model for simulating the single-stranded DNA dynamics Type of data Table and figure  How the data were acquired The normalized probabilities of the structural parameters are statistically obtained by coarse-graining the experimentally detected ssDNA structures [1][2][3] . A software UCSF Chimera (version 1.11.2) [4] is used to delete the unnecessary atoms in the all-atom structures. A software X3DNA (version 2.4) [5] is used to label the unpaired nucleotides. Data format Raw and analysed Description of data collection The normalized probabilities are calculated as following: (1) For each nucleotide, we calculate the centers of mass of the phosphate (P), sugar (S), and base thymine (B(T)).
(2) For each ssDNA chain, the structural parameters then can be calculated, including • and dihedral angles P i -S i -P i + 1 -S i + 1 and S i -1 -P i -S i -P i + 1 ; where the subscript represents the nucleotide index. and (3) the normalized probabilities of the structural parameters can be statistically analyzed from the corresponding parameters obtained in the step above. Data source location All selected 3D ssDNA structures are downloaded from the website of the protein data bank [6]

Value of the Data
• These data can provide an insightful view of the conformational changes for the unstructured ssDNAs. • These data can be used to support the developments of new CG ssDNA models or the modifications of the existing models. • The researchers who are interested in the ssDNA dynamics and simulation models will benefit from the data presented in this paper.

Data Description
The PDB identifications (PDBids) and the basic information (including the number of chains N c , the number of all nucleotides N n , and the ssDNA types) of the selected ssDNAs are summarized in Table 1 . More detailed information such as the sequences are deposited in the Mendeley Data database (Table 1: sequences of the selected ssDNA structures). Fig. 1 shows the normalized probability distributions of structural information the virtual bond lengths ( Fig. 1 (a1)-(a3)), the bond angles ( Fig. 1 (b1)-(b4)), and dihedral angles ( Fig. 1 (c1) and (c2)) for the CG ssDNAs. The ssDNA backbone-involved structural parameters, such as the bond length P-S, the bond angle P-S-P, and the dihedral angle P-S-P-S, are calculated from all selected ssDNA structures. As we mainly focus on the ssDNA polythymine poly(T), the base-involved structural parameters such as the bond length S-B and bond angle P-S-B, are obtained from the thymine-involved structures. In addition, the base-involved dihedral angles are not calculated as their effects on the conformations of the unstructured polythymine are weak [2] . The data are also deposited in the Mendeley Data database (Tables 2-4). a N c denotes the number of chains in a structural file. b N n denotes total number of nucleotides in a structural file. c The type named "mixed" represents the corresponding ssDNA with different compositions.

Experimental Design, Materials and Methods
The statistical analysis of the structural information for the ssDNAs mainly involves three steps: (1) selecting appropriate all-atom ssDNA structures; (2) coarse-graining the selected structures; and (3) calculating the CG structural parameters. The details are described as following: Selection of the all-atom ssDNA structures. The experimentally measured all-atom ssDNA structures are download from the protein data bank [6] . The selection of the structures based on the following criteria: (1) the ssDNA is bound to proteins to avoid the formation of helical structures [7] ; and (2) there are at least 8 consecutive unpaired nucleotides. A total of 72 ssDNA structure files with PDB format (see the PDBids in Table I) are used. Then we use the visualization tool UCSF Chimera [4] to delete unnecessary molecules (such as proteins and waters), ions and hydrogen atoms of the ssDNAs, and use the software package X3DNA [5] to label the unpaired nucleotides.
CG structures of the ssDNAs. We calculate the centers of mass for the atomic groups of the phosphate (P), sugar (S), and base (B) in each unpaired nucleotide. In particular, for the nucleotide i , the phosphate includes the atom phosphor and the directly bonded oxygen atoms (here an oxygen named O3' in PDB files in fact belongs to the nucleotide i -1), the sugar group , and B i (T)-S i -P i + 1 , respectively. (c1) and (c2) show the normalized probability distributions of the dihedral angles S i -1 -P i -S i -P i + 1 and P i -S i -P i + 1 -S i + 1 , respectively. Here the subscript represents the nucleotide index.
includes the sugar ring and an atom named C5', and the base group includes other atoms in this nucleotide except the atom O3' as it belongs to the phosphate of nucleotide i + 1. In the CG structures, the atomic groups are represented by the three types of pseudo-atoms P, S, and B located the corresponding centers of mass. The pseudo-atoms are assumed to be connected by virtual bonds.
Calculation of the structural parameters. For all CG ssDNA structures, we calculate the virtual bond lengths between two neighbouring pseudo-atoms (including P-S, S-P, and S-B(T)), the bond angles between two adjacent virtual bonds (including P-S-P, S-P-S, P-S-B(T), and B(T)-S-P), and the dihedral angles formed by three successive bonds (including P-S-P-S, S-P-S-P) in the backbone. Based on the calculation results, the normalized probability distribution for the corresponding structural parameters then can be statistically analyzed (see Fig. 1 ).

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
Normalized probabilities of structural parameters for CG ssDNAs (Original data) (Mendeley Data).