Identification of GCC-box and TCC-box motifs in the promoters of differentially expressed genes in rice (Oryza sativa L.): Experimental and computational approaches

The transcription factor selectively binds with the cis-regulatory elements of the promoter and regulates the differential expression of genes. In this study, we aimed to identify and validate the presence of GCC-box and TCC-box motifs in the promoters of upregulated differentially expressed genes (UR-DEGs) and downregulated differentially expressed genes (DR-DEGs) under anoxia using molecular beacon probe (MBP) based real-time PCR. The GCC-box motif was detected in UR-DEGs (DnaJ and 60S ribosomal protein L7 genes), whereas, the TCC-box was detected in DR-DEGs (DnaK and CPuORF11 genes). In addition, the mechanism of interaction of AP2/EREBP family transcription factor (LOC_Os03g22170) with GCC-box promoter motif present in DnaJ gene (LOC_Os06g09560) and 60S ribosomal protein L7 gene (LOC_Os08g42920); and TCC-box promoter motif of DnaK gene (LOC_Os02g48110) and CPuORF11 gene (LOC_Os02g01240) were explored using molecular dynamics (MD) simulations analysis including binding free energy calculations, principal component analyses, and free energy landscapes. The binding free energy analysis revealed that AP2/EREBP model residues such as Arg68, Arg72, Arg83, Lys87, and Arg90 were commonly involved in the formation of hydrogen bonds with GCC and TCC-box promoter motifs, suggesting that these residues are critical for strong interaction. The movement of the entire protein bound to DNA was restricted, confirming the stability of the complex. This study provides comprehensive binding information and a more detailed view of the dynamic interaction between proteins and DNA.


Introduction
Standing crops face various stresses during their life cycle, which result in a drastic reduction in yield [1]. Although some crops withstand environmental stresses by developing new features, others are unable to develop adaptive mechanisms and consequently die. Importantly, rice has a lower tolerance and higher susceptibility to abiotic stresses than other crops [2,3]. In PLOS  interactions occurring between AP2/EREBP family transcription factor(LOC_Os03g22170) and GCC and TCC-box DNA motifs using molecular and essential dynamics based binding mechanics analysis.

Selection of DEGs and MBP design
Microarray data of DEGs in anoxic rice coleoptiles [6] and a dataset of Kumar et al [14] were used to shortlist UR-DEGs and DR-DEGs for analysis in this study. The UR-DEGs and DR-DEGs were ranked based on their expression score �2 fold (�2X) and �-2 fold � -2X), respectively. The promoter sequences -499 to +100 bp of the selected UR-DEGs and DR-DEGs were retrieved from the Eukaryotic Promoter Database as described previously [14]. The retrieved promoter regions were analyzed using the MEME (Multiple Em for Motif Elicitation) web server (http://meme-suite.org/tools/meme). Furthermore, the consensus promoter motif of UR-DEGs and DR-DEGs were used to design MBPs using Beacon Designer 7 (BD7, PREMIER Biosoft, USA). Custom made MBPs and primers were procured from Gene Link, (New York, USA). The methodology used for rice genomic DNA isolation and the validation of the consensus promoter motif is described in our previous work. It is well established that the AP2/EREBP transcription factor (TF) DNA-binding domain (DBD) binds to GCC-box [12,13,15,26]. The AP2/EREBP TF model from rice was generated using SWISS-MODEL web server [27] and the structure quality was assessed using PROCHECK [28] based on the Ramachandran plot. A three dimensional (3D) structural model of the DNA motif was generated using 3D-DART (3DNA-Driven DNA Analysis and Rebuilding Tool) [29]. Five 3D DNA models of GCC-(CGCCGCCGCCG) and TCC-box motifs (CTCCTCCTCCTCCTC) were generated with a bend angle of 0-40˚. 3D-DART enables the generation of DNA models based on customized local and global conformations, such as the bend angle range and bend angle orientation range.

High ambiguity driven protein-DNA docking
For the protein-DNA interaction study, DNA models of gene promoter motifs (GCC-and TCC-box) were docked onto the specific site of the AP2/EREBP TF using the HADDOCK (High Ambiguity Driven protein-protein Docking) web server (version 2.2) [15,26,30]. Residues 68, 69, 71, 73, 75, 77, 82, 83, 90, 92, 94, 95, 108, 109, and 110 were considered as active site residues for the protein, and 1-50 base pair (bp) nucleotides from both DNA stands were selected as active residues for the DNA motif. Passive residues were spontaneously defined around active residues. In reference to active and passive residues, Ambiguous Interaction Restraints (AIR) was generated. Here, illustration and visualization of the final docked complex were completed using UCSF Chimera [31].

Molecular dynamics simulations for the protein and docked complexes
To study the dynamics and recognition mechanism between AP2/EREBP TF and DNA motifs, the generated complexes were subjected to MD simulations using the GROMACS 5.0 software package [32,33]. OPLS-AA/L all-atom force field and AMBER99SB-ILDN force field were applied to AP2/EREBP TF and protein-DNA complexes simulations, respectively [34]. Furthermore, systems were solvated in a minimal cubic water box using the Simple Point Charge (SPC) water model [35]. Solvated systems carry a charge; therefore, ions were added to neutralize the entire system by substituting water molecules with ions. The systems were energy minimized (50000 cycles of steepest descent) to remove steric clashes and inappropriate geometry. The minimized systems were equilibrated (the solvent and ions around the protein needed to be equilibrated) into NVT (constant Number of particles, Volume, and Temperature) and NPT (constant Number of particles, Pressure, and Temperature) phases for 1000 ps [25,[36][37][38]. The well-equilibrated systems were then subjected to a production run at 300 K and 100000-pascal pressure for 50,000 ps. The analyses of the 50 ns MD trajectories were carried out using GROMACS built-in tools. The various interactions involved in the pre-and post-MD of protein-DNA simulated complexes were deduced using Nucplot [39]. The stability of the complex was calculated by measuring the RMSD (root mean square deviation) of the protein backbone atoms' positions with respect to the start or reference structure using the following equation: where M=S i m i and r i (t) is the position of atom i at time t after least square fitting the structure to the reference structure. The RMSF (root mean square fluctuations) was calculated using the following equation: where T is the time over which one wants to average and r i ref is the reference position of particle i.

Binding free energy and free energy decomposition analysis
The package g_mmpbsa calculates the binding energy of bimolecular associations such as protein-protein, protein-ligand, and protein-DNA associations using the Molecular Mechanics Poisson-Boltzmann Surface Area (MM-PBSA) protocol [40]. It provides the different components of energy terms such as polar solvation, non-polar solvation, and electrostatic energy. The MmPbSaDecomp.py python script was used to determine the residue-wise contribution to the total binding energy, which provides information about important residues contributing to the molecular association.

Principal component analysis (PCA) and free energy analysis
Principal component analysis (PCA) is widely used to gain insights into the adequate structural and dynamics of the protein and complex trajectories [41]. PCA is a multivariate statistical analysis used to extract covariant motions on a number of different lengths and time scales from a protein structure. The covariance matrix of the atomic fluctuations was calculated using the gmx-covar module of gromacs software and calculated using the following equation: In which, C implies 3n x 3n symmetric matrix, n is a number of residues and M is a diagonal matrix [42]. Diagonalization of this matrix yields a set of eigenvectors and eigenvalues that describe collective modes of fluctuations of the protein. The eigenvectors corresponding to the largest eigenvalues are called "principal components", as they represent the largest-amplitude collective motions. The eigenvectors were analyzed using the gmx-anaeig gromacs built-in command. The gmx-sham tool was used to generate the input for free energy landscapes using the axes of a principal component analysis.

GCC-box and TCC-box detection and validation
Under anoxia UR-DEGs with expression by equal or higher than two-fold (�2X) and expression by equal or lower than -2 fold (� -2X) for DR-DEGs were selected from the microarray results [6] and the aforementioned datasets [14]. The selected UR-DEGs and DR-DEGs were analyzed using MEME (v 4.5.0) to identify consensus promoter motifs (GCC-box and TCCbox). We identified the presence of GCC-box and TCC-box motifs in the promoter region of UR-DEGs (DnaJ and 60S ribosomal protein L7) and DR-DEGs (DnaK and CPuORF11), respectively. The GCC-box motif was acknowledged in the DnaJ (EP01201) and 60S ribosomal protein L7 (EP02799) genes with the lowest p-value of 6.28e -07 indicates the most significant match score of the given motifs ( Fig 1A). Similarly, TCC-box motifs were identified in DnaK (EP03077) and CPuORF11(EP01079) genes with the lowest p-value 7.37e -10 ( Fig 1B).
Gene expression studies revealed the upregulation of genes encoding transcription factors under hypoxic response in Arabidopsis [5]. However, the regulation of gene expression occurs through the core promoter motif sequence [43]. Promoter motifs contain specific nucleotide sequences that are responsible for gene regulation and function under different biotic and abiotic conditions. Hence, the identification and validation of these regulatory elements are essential. Expression analysis of the 60S ribosomal protein L7 has been used as an internal control for gene expression studies in Coffea arabica under different experimental conditions [44]. DnaJ, which contains a J domain of 70 amino acid consensus sequence, is a co-chaperone of Hsp70 (DnaK) and facilitates Hsp70's ATPase activity, substrate delivery, and specific cellular localization [45]. In Arabidopsis and rice, J proteins have been implicated in the protection against environmental stresses [46]. DnaK family proteins also include heat shock proteins that are involved in protecting plants against abiotic stresses [47]. CPuORF11, which has an ORF found in the 5' UTR of a mature mRNA, mediate translational regulation in response to sucrose concentration, amino acid production, starvation and polyamine concentration. However, it's mechanism of action is not clearly raised in Arabidopsis and Rice [48][49][50]. A sequence of GCC-box and TCC-box repeats was used to design a molecular beacon probe. Forward and reverse primers were designed (Table 1) [15,26]. In the present study, two UR-DEGs (DnaJ and 60S ribosomal protein L7) and two DR-DEGs (DnaK and CPuORF11) were validated by experimental and computational studies. The presence of GCC-and TCC-boxes in selected genes was verified by real-time PCR assays. We have taken promoter region belongs to TSS (transcription start site) of the selected gene considering promoter position from -499 to +100 i.e., 600 nt and the same region has been used for motif detection by MBP. In DnaJ gene promoter is in upstream position i.e., -62 to -52 and in 60S ribosomal protein L7 gene promoter, GCC box position is in downstream i.e., from + 30 to + 40 (Table 2). Similarly, in DnaK gene promoter, TCC box position is in upstream -18 to -4 and in CPuORF11 genes promoter TCC box position is in upstream -58 to -44 ( Table 2). Amplification of GCC-box sequences was confirmed by MBP, with average Ct values of 34.21 and 31.65 for DnaJ and 60S ribosomal protein L7, respectively ( Table 2). Similarly, TCC-box containing genes were amplified by MBP, with average Ct values of 27.79 and 28.5 for DnaK and CPuORF11, respectively (Table 2).
In rice, the Submergence1 (Sub1) locus encodes three ethylene-responsive factor (ERF), transcriptional regulators. It has been described that a large member of the ERF family interacts specifically with AGCCGCC through their conserved domain [51]. Direct interaction of GCC-boxes and non-GCC-boxes with Tomato transcription factor Pti4 (an ERF) revealed the involvement of ERFs in gene regulation and expression [52]. The binding of maltose binding protein (AtERF) to the GCC sequence (AGCCGCC) in Arabidopsis was hampered when both G residues within the GCC-box were replaced by T (ATCCTCC) [8,53]. Several reports based on the gene ontology classification and differential expression of DnaJ, 60S ribosomal protein L7, DnaK and CPuORF11 genes in diverse species suggest that these genes are involved in cellular, biological, and molecular functions in the plant. In our previous work, MBP based real-time PCR analysis indicated that UR-DEGs and DR-DEGs under anoxic conditions that contained a GCCbox and TCC-box in their promoter region bound AP2/EREBP TF in rice [15]. Hence, validation of the in silico findings of GCC-box and TCC-box promoter motifs in the UR-DEGs (DnaJ and 60S ribosomal protein L7) and DR-DEGs (DnaK and CPuORF11) in O. sativa is essential.

Protein and DNA motif modeling
BLASTP was performed for AP2/EREBP TF sequences (LOC_Os03g22170) against the PDB database. Blast hits showed a 71% sequence identity with an E value of 2e -21 to the recently solved crystal structure of AtERF96 containing a GCC-box (resolution: 1.76Å) from Arabidopsis thaliana (PDB ID: 5wx9; chain A), which was selected as a template for the construction of the AP2/EREBP TF model (Fig 2A). Analysis of the stereochemical quality of individual residues in the protein was carried out using Ramachandran Plot. In the generated model, the percentage of residues in the most favored regions and additional allowed regions was 89.7% and 6.9%, respectively. According  Structural insights into DNA motif and their interaction with AP2 protein in rice to the plot, 3.4% of the residues were located in the disallowed region. Analysis of the secondary structure of the AP2/EREBP TF revealed that it consists of one β-sheet, three β-strands, one α-helix, five β-turns, and one gamma (γ) turn (S1A Fig).

Analysis of molecular dynamics (MD) simulations of the AP2/EREBP TF
Structural refinement was carried out using molecular dynamics (MD) simulations with solvents and ions. Superimposition of pre-and post-MD simulated AP2/EREBP TF revealed a backbone RMSD deviation of 1.17Å (Fig 2A). The AP2/EREBP TF attains equilibrium after 10 ns and sustains the stability until the end of the simulation time period with an average RMSD of 0.59 nm (Fig 2B). RMSF showed a peak for individual residue, and two regions of the protein showed the highest fluctuation; 83-90 and 110-120 residues, whereas the rest of the structure remained stable with an average RMSF value of approximately 0.17 nm (Fig 2C). The radius of gyration of the protein backbone atoms was 1.26 nm, which contributed to the compactness of the protein. The representative structure was extracted from the stable time frame and used for the protein-DNA docking analysis. The simulated structure was analyzed using Ramachandran Plot, which revealed that residues found in the additional allowed regions had increased to 15.5% whereas, residues found in the disallowed region reduced to 1.7%, suggesting that the MD simulations increased the stability of the protein structure [54]. No difference in the secondary structure elements was observed in the pre-and post MD simulated AP2/EREBP TF structures (S1B Fig).

Protein-DNA interaction and stability analysis
To predict which amino acids interact with DNA, the representative structure of the AP2/ EREBP TF was docked with a GCC-box and TCC-box using HADDOCK.  Table 3). The IHSAPDTM-BS complex was stabilized by the formation of five hydrogen bonds (H-bonds) (Arg68, Arg73, Lys77, Lys87, and Thr95) and six hydrophobic interactions (Table 4 and Fig 3A). Similarly, four bonds (Arg64, Arg73, Lys77, and Arg83) and an extensive network of seven hydrophobic interactions reinforced the IRPAPDTM-BS complex stability (Table 4 and Fig 3B). It was evident from the HADDOCK results that DNA bends at 40˚in both IHSAPDTM-BS and IRPAPDTM-BS complexes (GCCbox) had a strong affinity for the AP2/EREBP TF. The highest HADDOCK score for IDNAPDTM-BS and IOFAPBTM-BS (TCC-box) complexes were found to be -144.2 ± 2.8 and -147.5 ± 7.3, respectively ( Table 3). The number of  hydrogen bonds and hydrophobic interactions in IOFAPBTM-BS and IDNAPDTM-BS complexes were nine (Arg64, Arg68, Gly69, Arg71, Arg72, Arg83, Arg90, Lys117, and Lys119) and three, and eight (Glu62, Arg63, Arg64, Arg68, Arg71, Arg73, Thr106, and Lys119) and five, respectively (Table 4 and Fig 3C and 3D). The cluster size and Z-score for the selected clusters were 33 and -1.6 for IDNAPDTM-BS, and 39 and -2.0 for IOFAPBTM-BS, respectively. DNA bends at 40˚and 20˚in IDNAPDTM-BS and IOFAPBTM-BS complexes had strong binding affinities. The HADDOCK results were selected for further MD simulations. Therefore, the conformation adopted by DNA play a very significant role in specific interaction between AP2/EREBP TF and DNA [55].

Conformational and interaction analysis of the docked complexes after MD simulations
To examine the dynamics and to gain specific interaction information, the protein-DNA complexes were subjected to 50 ns MD simulations. IHSAPDTM-BS and IRPAPDTM-BS attained a final conformation with a backbone RMSD of approximately 0.53 nm and 0.37 nm, respectively ( Fig 4A). In addition, IDNAPDTM-BS and IOFAPBTM-BS showed an average deviation from the initial structure of 0.36 nm and 0.43 nm, respectively (Fig 4A). RMSD value for the backbone atoms less than 1.0nm suggested stability of the complex structures [56]. Furthermore, the structural deviations of the DNA-bound complexes were analysed at regular time intervals across the simulation trajectory (S1 Table). The RMSF value of key residues stabilizing the IHSAPDTM-BS (Arg68, Arg73, Lys87, Arg90, and Thr95) and IRPAPDTM-BS (Arg71, Arg72, Arg73, Trp75, Arg83, Lys87, and Thr95) complexes varied from 0.08 to 0.25 nm, respectively (Fig 4B). The RMSF value for the interacting residues in IOFAPBTM-BS (Glu62, Arg63, Arg64, Arg72, Arg83, Lys87, and Arg90) and IDNAPDTM-BS (Glu62, Arg63, Arg64, Thr65, Arg68, Arg71, Thr106, and Lys117) ranged from 0.07 to 0.34 nm, respectively (Fig 4B). Moreover, the radius of gyration and the hydrogen bond analysis for all four complexes indicated the compactness and stability of the complexes (Fig 4C and 4D). MD analysis results indicated that all four complexes underwent minor conformational changes during the simulation time period. The representative docked complexes were extracted from the stable time frame for the identification of key interacting residues.
A comparative interaction analysis was carried out for all protein-DNA complexes. The total number of hydrogen bonds remained unchanged in pre-and post-MD simulated IHSAPDTM-BS and increased from four to seven in IRPAPDTM-BS complexes (Table 4). However, in the IOFAPBTM-BS, the number of hydrogen bonds decreased from nine to seven but remained constant for the IDNAPDTM-BS complex (Table 4). In subsequent MD simulations, the number of hydrophobic interactions reduced drastically in all complexes (IHSAPDTM-BS, IRPAPDTM-BS, and IOFAPBTM-BS) except IDNAPDTM-BS (Table 4). Most of the interacting residues in the pre-simulated complex were conserved in the post-simulated structures, suggesting that they play a crucial role in the formation of AP2/EREBP TF -DNA complex. Structural insights into DNA motif and their interaction with AP2 protein in rice

Conformation analysis of the complexes
To study the conformational variation during MD simulations, we extracted snapshots of each complex at 10 ns intervals (0ns, 10 ns, 20 ns, 30 ns, 40 ns, and 50 ns) and analyzed these for the IHSAPDTM-BS, IRPAPDTM-BS, IDNAPDTM-BS, and IOFAPBTM-BS complexes (S3 and S4 Figs). The analysis revealed that the amino acid residues involved in the formation of hydrogen bonds (H-bond) with the DNA remained stable and consistent after 10 ns (S2 Table). Thus, the overall MD simulation trajectory analysis along with the comparative interaction analysis at regular time intervals, indicated that there was a fairly stable interaction between the AP2/EREBP TF and DNA motif through H-bonding and hydrophobic interactions [57].

Binding free energy analysis
Calculation of protein-DNA binding free energy is a very vast field of research and computational techniques. MM-PBSA method uses the last 5 ns (45-50 ns) of MD simulation trajectories to calculate the binding free energy components, including van der Waal energy, electrostatic energy, polar and non-polar energies and their contribution towards protein-DNA complex stability. The total binding free energy for the IHSAPDTM-BS, IRPAPDTM-BS, IDNAPDTM-BS, and IOFAPBTM-BS complexes were computed to be   Table 5). The effect of each residue to the binding energy was computed and showed that the contribution of most of the common interacting residues (Arg68, Arg72, Arg83, Lys87, and Arg90) was observed to be very similar in DNA-bound complexes, suggesting a significant role for these residues in complex stabilization (Fig 5A-5D). Highest contributions were made by Structural insights into DNA motif and their interaction with AP2 protein in rice electrostatic energy, followed by polar energy. The high binding energy profile was in agreement with the interaction profile of each DNA-bound complex.

Analysis of conformational fluctuation in AP2/EREBP TF and DNAbound complexes
The development of multivariate methods, such as PCA, promises to enrich the analysis of MD data and to reveal quantitative insights into the relationships between structure, dynamics, and function. Covariance provides information about the cooperativity of motion and can be positive or negative, however, the trace is the sum of the leading diagonal, therefore, and the trace is the sum of the individual variances [58]. The trace value for the AP2/EREBP TF, IHSAPDTM-BS, IRPAPDTM-BS, IDNAPDTM-BS, and IOFAPBTM-BS was 7.6 nm 2 , 8.2 nm 2 , 4.5 nm 2 , 6.3 nm 2 , and 6.2 nm 2 , respectively; the small trace values corresponded to Structural insights into DNA motif and their interaction with AP2 protein in rice positive covariance and confirmed the decrease in flexibility in the collective motion of the protein, thus revealing a higher stability (Fig 6). The covariance matrix was used to generate the eigenvector and its corresponding eigenvalues for the AP2/EREBP TF and DNA-bound complexes (S5 Fig). The Gibbs free energy (ΔG) value ranged from 12.6 to 14.7 kJ/mol for DNA-bound complexes. The overall results indicated the stability of the AP2/EREBP TF and its DNA-bound complexes (Fig 7).

Conclusion
We successfully designed MBP and specific primers for UR-DEGs (DnaJ and 60S ribosomal protein L7) and DR-DEGs (DnaK and CPuORF11) and validated the presence of GCC-box and TCC-box promoter motifs. The molecular dynamics study of the protein-DNA complexes revealed a high binding affinity of the AP2/EREBP TF for GCC-and TCC-box motifs in selected genes. The GCC-box amino acid residues Arg68, Arg71, Arg72, Arg73, Trp75, Arg83, Lys87, Arg90 and Thr95, and the TCC-box amino acid residues Glu62, Arg63, Arg64, Thr65, Arg68, Arg71, Arg72, Arg83, Lys87, Arg90, Thr106, and Lys117 directly interacted with DNA. Consequently, these residues play an important role in the stabilization of the complex and the regulation of the differential expression of these genes in rice. Therefore, our results shed light on the underlying mechanism of GCC-box and TCC-box recognition by proteins.