Brief Bioinformatics characterization of Cotton bZIP transcription factors family from Gossypium hirsutum, Gossypium arboreum and Gossypium raimondii

Cotton is an important commodity in the world economy. In this study we have carried out genome-wide identication and bioinformatics characterization of basic leucine zipper domain proteins (bZIPs) from cultivated cotton species G. hirsutum along with two sub-genome species of allotetraploid cotton, G. arboreum and G. raimondii. A total of 228 bZIP genes of G. hirsutum, 91 bZIP genes of G. arboreum and 86 bZIP genes of G. raimondii were identied from CottonGen database. Cotton bZIP genes were annotated in standard pattern according to their match with Arabidopsis bZIPs. Multiple genes with similar bZIP designations were observed in cotton, linked to the gene duplication. Cotton bZIPs are distributed across all 13 chromosomes with varied density. Phylogenetic characterization of all three cotton species bZIPs classied them into 12 subfamilies, namely A B, C, D, E, F, G, H, I, J, K and S and further into eight subgroups according to their predicted functional similarities, viz., A1, A2, A3, C1, C2, S1, S2 and S3. Subfamily A and S are having maximum number of bZIP genes, subfamily B, H, J and K are single member families. Cotton bZIP protein functions were predicted from identied motifs and orthologs from varied species. BRLZ domain analysis of G. raimondii bZIPs revealed the presence of conserved basic region motif N-X7-R/K in almost all subfamily members, variants are GrbZIP62 with N-X7-I motif and GrbZIP76 with K-X7-R motif. Leucine heptad repeats motif, are also present in variant numbers from two to nine with leucine or other hydrophobic amino acid at designated position among 12 subfamily members. STRING protein interaction network analysis of G. raimondii bZIPs observed strong interaction between A-D, B-K and C-S subfamily members. at one platform will certainly help the researchers in the selection of specic cotton bZIP genes according to the close alignment with Arabidopsis orthologs or sub-genome homolog for functional characterization. proteins is involved in ower development, seed development; salicylic acid mediated signaling pathway and pathogen response. MEME motif search identied DOG1- seed dormancy control motif, tetratricopeptide protein-protein interaction motif in these family members. GossypiumbZIP20, 21, 22, 26, 45, 46, 47, 50, 57 and 65 are representing GossypiumbZIPs subfamily D.


Introduction
Cotton is a signi cant cash crop. Approximately 32.93 million hectare was under cotton cultivation around the globe in 2019, with the highest production of 5.77 million metric tons in India followed by US and China (http://ministryoftextiles.gov.in; https://www.statista.com). The textile industry is a signi cant contributor to the nation's economy and employment, the global textile market share was USD 961.5 billion in 2019 (https://www.grandviewresearch.com).
Functional genomics is a key approach for identifying genes for important traits like yield, biotic-abiotic stress tolerance and ber quality. Cultivated G. arboreum, A subgenome species, and its counterpart, non-spinnable ber producer G. raimondii, D subgenome species are two progenitor species of cultivated allotetraploid cotton, G.hirsutum (Li et al. 2014). Estimated genome size of G. hirsutum is ~ 2305.2Mb . Estimated genome size of G. arboreum is ~ 1746Mb which is around two fold larger than G. raimondii, ~ 880Mb (Hendrix et al. 2005). Genetic information related to TFs of the two important progenitor species will play major role in the improvement of cultivated allotetraploid cotton. Wang et al. 2012 identi ed 2,706 transcription factors in G. raimondii which includes 208 bHLH and 219 MYB class genes which were preferentially expressed in ber (Wang et al. 2012). Kushanov et al. 2016 developed CAPs and DCAPs markers for the PHYA1, PHYB and HY5 genes which are associated with the ber quality and owering time traits (Kushanov et al. 2016). Over-expression of G. hirsutum bZIP transcription factor, ABF2 (bZIP36) resulted in improved drought and salt tolerance in cotton and Arabidopsis (Liang et al. 2015).
TFs are the key regulators in plant development and stress adaptation. On the basis of conserved DNA binding domains, TFs are classi ed into different families among which bZIP-TFs play a major role in seed germination, ower development, and stress response. bZIP proteins speci cally bind to DNA with an ACGT core, A-box (TACGTA), C-box (GACGTC) and G-box (CACGTG) motifs and binding speci city is regulated by anking nucleotides (Jakoby et al. 2002). The bZIP domain carries two structural components located on a contiguous alpha-helix, the rst one is a ~ 16 amino acid residues containing nuclear localization signal, followed by a DNA binding domain invariant N-X7-R/K and the second one is present exactly nine amino acids towards the C-terminus which is a leucines (L) heptad repeat or other bulky hydrophobic amino acids creating an amphipathic helix, L-X6-L-X6-L (Jakoby et al. 2002). In Arabidopsis bZIPs, variation observed in leucine repeats, subfamily 'D' bZIPs contains three repeats whereas; more than eight repeats are present in subfamily C and S (Wolfgang et al. 2018).
So far, plant bZIP proteins are classi ed into 9 to 13 subfamilies in different plant species based on structural and functional characteristics. Marc Jakoby et al. 2002 rst classi ed Arabidopsis thaliana bZIP TFs into 10 subfamilies A to I and S. These groups were named with letters considering their family members, for example A for ABF/AREB/ABI5, B for big protein size, C for CPRF2-like, G for GBF, H for HY5, and S for small protein size (Jakoby et al. 2002 (Nijhawan et al. 2008).
If we look at the cotton bZIP protein characterization, Zhang et al. 2018 identi ed 159 bZIP genes from G. arboreum and classi ed into 13 groups (Zhang et al.2018). Azeem et al. 2020 identi ed 87 bZIP genes of G. arboreum and 85 bZIP genes of G. raimondii from NCBI and classi ed into 11 subfamilies. Recently Wang et al. 2020 identi ed 207 bZIP genes of allotetraploid cotton G.hirsutum and classi ed into 13 subfamilies, A to I, S, M, K and J, having major contribution from subfamily 'A' and 'S' bZIPs (Wang et al. 2020). The basic differences between our bioinformatics characterization of cotton bZIPs and from reported studies are, we have characterized cultivated cotton species and two sub genome species together at one platform whereas two independent studies by Azeem et al. 2020 andWang et al. 2020 characterized two sub genome species and G. hirsutum bZIP TFs separately. Importantly we have annotated bZIP TF according to standard pattern for clarity and to understand particular gene duplication within all three species.
We have listed the duplicated bZIP genes of individual species. We have classi ed cotton bZIP family in detail including subfamily and subgroups according to their predicted function. Phylogenetic analysis of bZIPs of G. hirsutum along with two sub genome species will be useful in the process of selection of close orthologs for further characterization.
Emphasizing their functional importance, bZIP protein family in plants play crucial roles in developmental and stress responses. Abscisic acid responsive element binding factors (ABFs) from bZIP subfamily 'A' are activated by ABA signaling and are involved in downstream gene regulation in conjunction with stomatal closure mediated adaptation in drought stress (Sirichandra et al. 2010;Yoshida et al.2015). Wolfgang and Christoph described bZIP family 'C' genes (bZIP9, bZIP10, bZIP25, bZIP63) and S1 genes (bZIP1, bZIP2, bZIP11, bZIP44, bZIP53) interaction network in response to nutritional starvation and metabolic adaptation under nutritional stress (Wolfgang et al. 2018). TGA or subfamily 'D', bZIP family proteins binds to the TGACG consensus sequences, to form homo and hetero-dimer. TGA family proteins are predominantly involved in systemic acquired resistance through SA mediated interaction with NPR1 (Fan et al. 2002;Fu et al. 2013). TGA family member PERIANTHIA (PAN/bZIP46) plays a role in ower development, interact with ROXY-type GRX (CC type glutaredoxin, GRX-ROXY1) gene to regulate petal development (Gatz 2013;Gutsche et al. 2017). Recently, Ullah et al. 2019 studied soybean TGA family genes and differentiated legumes-speci c TGAs structures, involvement in legumes-speci c biological processes like legumes-rhizobia symbiotic nodulation and predicted soybean bZIPs role in response to nitrogen.
The bZIP TF family from cotton yet to be fully understands and explored. Thus, this paper attempts phylogenetic analysis, structural characterization and functional role prediction of bZIPs from cultivated cotton species G. hirsutum along with two subgenome species G. arboreum and G. raimondii.

Material And Methods
A total of 228 bZIP genes of G. hirsutum, 91 bZIP genes of G. arboreum and 86 bZIP genes of G. raimondii were identi ed from CottonGen database using bZIP domain as a query as well as Arabidopsis bZIP protein sequence as a query or through keyword search to recent genome assembly of Gossypium hirsutum (AD1) 'TM-1' genome UTX_v2.1 (published article in may 2020), Gossypium raimondii JGI proteins and Gossypium arboreum BGI proteins (https://www.cottongen.org/). Arabidopsis-bZIP protein sequences were downloaded from TAIR (https://www.arabidopsis.org/). Cotton bZIP genes were annotated on the basis of E-value, bitscore value and percent identity resulted from Arabidopsis-bZIP match. All the data regarding gene ID, genomic position, exon number, CDS length, protein length, protein sequences and Arabidopsis match ID with E-value are recorded in supplementary le 1/, online resource1. Phylogenetic analysis of 227 bZIP genes of G. hirsutum, 56 bZIP genes of G. arboreum and 57 bZIP genes of G. raimondii along with 73 bZIP genes of Arabidopsis was performed. Duplicated genes from G. arboreum and G. raimondii were excluded from phylogenetic analysis and selected rst representative gene. For example GabZIP14 is having two entries GabZIP14-1 and GabZIP14-2 so GabZIP14-1 is selected for analysis and designated as GabZIP14. Phylogenetic tree was constructed by maximum likelihood phylogeny method, Dayoff (PAM) protein substitution model and performing 100 bootstrap, using MEGA X (https://www.megasoftware.net/). MEME (http://meme-suite.org) motif search analysis was done for all identi ed bZIP genes, 228 bZIP genes of G. hirsutum, 91 bZIP genes of G. arboreum and 86 bZIP genes of G. raimondii, selecting 10 motifs identi cation option and any number of repetition in case of distribution. Identi ed motifs were evaluated using SMART (http://smart.emblheidelberg.de/), Pfam (https://pfam.xfam.org/search/sequence), Interpro (https://www.ebi.ac.uk/interpro/) and TOMTOM. MAST le of all searches were submitted along with supplementary documents.
To understand the interaction network of G. raimondii bZIP proteins, STRING (https://string-db.org/) analysis was performed. Network was interpreted for strong protein interactions, sharing functions among themselves with highest edge con dence (0.900), which can be correlated by thickness of the edges.

Results
G. hirsutum, G. arboreum and G. raimondii bZIP protein classi cation Phylogenetic analysis of 227 bZIP genes of G. hirsutum, 56 bZIP genes of G. arboreum and 57 bZIP genes of G. raimondii along with 73 bZIP genes of Arabidopsis was performed. The phylogenetic analysis indicates that G. hirsutum, G. arboreum and G. raimondii bZIPs are closely related to Arabidopsis bZIPs and are conserved among all three genomes. Gossypium bZIP gene duplication happened before allotetraploidation, as along with G. hirsutum, bZIP gene duplication observed in sub-genome species, G. aroboreum and G. raimondii. So it is possible that during sub-genome speciation from common ancestor gene duplication event might have happened in cotton. Figure 1 shows phylogenetic tree construction. Bootstrap percentage values are mentioned on each branch. Analyzed G. hirsutum, G. arboreum and G. raimondii bZIP genes are classi ed into 12 subfamilies A B, C, D, E, F, G, H, I, J, K and S and 8 subgroups A1, A2, A3, C1, C2, S1, S2 and S3. Subgroup classi cation was done according to clade separation within subfamily and predicated functional similarities. Further this classi cation is supported by Arabidopsis-bZIPs alignment, G. raimondii BRLZ domain alignment, exon-intron numbers, protein length and conserved motif identi cation ( Fig. 1 and Table 1, 2, 3). For the ease of understanding, in the following elaborated subfamily wise description bZIP proteins from G. hirsutum, G. arboreum and G. raimondii collectively named as Gossypium bZIPs.
Subfamily B contain endoplasmic reticulum (ER) stress response transcription factor bZIP17, involved in salt and osmotic stress resistance in Arabidopsis (Liu et al.2008) and observed interacted with subfamily K member bZIP60 in STRING analysis, which is also involved in ER stress response.
Subfamily C, subgroup C1 contains GossypiumbZIP9, subgroup C2 contains GossypiumbZIP10, 25 and 63, reported to form hetero dimer with S1 subfamily bZIPs, are known to involved in anther development, positive regulation of seed maturation and response to starvation Wolfgang et al. 2018).
Subfamily D of bZIP proteins is involved in ower development, seed development; salicylic acid mediated signaling pathway and pathogen response. MEME motif search identi ed DOG1-seed dormancy control motif, tetratricopeptide protein-protein interaction motif in these family members. GossypiumbZIP20,21,22,26,45,46,47,50,57  GossypiumbZIP34, 61 and 76. In Arabidopsis thaliana researchers discovered that a conserved proline residue in the third heptad region of leucine zipper of AtbZIP34 and AtbZIP61 interferes with the formation of homo-dimer whereas change of proline by an alanine in the above mentioned region can form homo dimer and bind to the G-box element (Shen et al.2007). A conserved proline residue which interferes homodimer formation is also found in the third heptad region of leucine zipper of GossypiumbZIP34 and 61 genes except in GrbZIP61-1 which is carrying serine instead of proline. GrbZIP61-1 can be further explored to con rm homodimer formation and G-box binding activity restoration due to the presence of serine instead of proline, which may be useful to understand GrbZIP61-1 functional involvement in stress response and in systems related to circadian clock, light and temperature, refer supplementary le 1/ Online resource 1 for alignment (Ezer et al. 2017;Wolfgang et al. 2018). It is indicated that bZIP34 and bZIP61 is involved in the hetero-dimer formation with I and S subfamily members of bZIPs which are function in vascular development (Shen et al.2007).
Subfamily F which is well known to be involved in adaptation to zinc de ciency in Arabidopsis, wheat and barley contains GossypiumbZIP19, 23 and 24 (Assuncao et al. 2010;Inaba et al. 2015;Nazri et al. 2017;Evens et al. 2017).
Subfamily G plays role in binding to G box motif of genes regulated by hormones and light contains GossypiumbZIP16, 41 and bZIP55.
Subfamily H contains HY like transcription factor GossypiumbZIP56, HY5 which is involved in PhyB signaling pathway and also associated with ber quality in cotton. HY5 gene is a positive regulator of photo morphogenesis (Wolfgang et al. 2018). Kushnov et al.2016 has done comparative sequence analysis of three close relative of ber quality genes, PHYA1, PHYB, HY5 of G. hirsutum and G. barbadense, developed dCAPS markers which can be utilized in markerassisted selection breeding for introgression of these genes into the either of the two alloteraploid cotton species.

Conserved motif analysis of G. hirsutum, G. arboreum and G. raimondii bZIP proteins
Signature bZIP domain is con rmed in all identi ed bZIP proteins of G. hirsutum, G. arboreum and G. raimondii. MEME motif analysis deciphering the presence Abscisic acid insensitive 5-Like protein 5-related motif in G. hirsutum subfamily 'A' and 'J" members. Sterile alpha motif (SAM)/Pointed domain which is involved in protein-protein interaction is identi ed in subfamily 'A' bZIPs of all three species. G. raimondii subfamily 'A' bZIPs are carrying MYND-Zinc binding domain which is involved in protein -protein interaction in the transcriptional regulation context. G. arboreum Subfamily 'A' bZIPs, GabZIP18-3 and GabZIP11-3 genes are carrying GluR7, glutamate receptor domain from GluR proteins known to be function in light signal transduction and calcium homeostasis.
The presence of two DOG1-seed dormancy control motif and TGA2 like motif is identi ed in all three gossypium species 'D' subfamily bZIP proteins. Exclusively G. raimondii subfamily 'D' bZIPs are carrying the RPA interacting motif, which is conserved in eukaryotic DNA repair, replication proteins and presence of tetratricopeptide repeats motif which are involved in protein-protein interactions. Further presence of tetratricopeptide repeats and other protein protein interaction domains in subfamily D should characterize, as TGA transcription factor family members, are involved in biotic stress tolerance through interaction with NPR1 and similarly STRING analysis also identi ed D subfamily protein interaction with A subfamily proteins. A typical example of identi ed motifs in G. raimondii bZIP proteins has shown in Fig. 2 and some motifs are listed in Table 3. BRLZ domain structural characterization of G. raimondii bZIP proteins N-X7-R/K basic region The conserved basic region N-X7-R/K variant is present in almost all analyzed G. raimondii bZIP proteins, variation is observed in subfamily E and J. Subfamily E-bZIP76 is having K-X7-R motif, subfamily J-bZIP62 is having N-X7-I motif.

Leucine heptad repeats variation in G. raimondii bZIPs
Variation in presence of number of leucine heptad repeats motif is observed in G. raimondii bZIP proteins. Figure 3 deciphering the detail structure of leucine heptad repeats of G. raimondii bZIP proteins, for GrbZIP69 and 76 refer supplementary le 1. First two leucine heptad repeats are conserved among all G. raimondii bZIP proteins with exception of GrbZIP17, GrbZIP61, GrbZIP76 and S subfamily proteins GrbZIP 2, 4, 11, and 44 which are carrying either methionine, isoleucine, asparagine or valine at L0, L1 or L2 position (L0 to L9 denotes leucine position in leucine heptad repeats L-X6-L-X6-L motif).
Subfamily-wise leucine heptad repeats are described as follows Subfamily A bZIP proteins are carrying two to four heptad repeats of leucine or presence of other hydrophobic amino acids at leucine position in last heptad.
Subfamily B, C, E, F and S bZIP proteins are carrying seven to nine heptad sequences with leucine repeats or other different hydrophobic or polar amino acids at leucine position; alanine observed predominantly at leucine position in subfamily C and valine is predominantly present in subfamily S proteins at leucine position.
Subfamily D, G, I and J bZIP proteins are carrying three to six leucine heptad sequences or other hydrophobic amino acids are present at leucine position with predominance of glycine in D subfamily and methionine in I subfamily members.
Pure leucine heptad repeats are present in subfamily H member GrbZIP56 and subfamily K member GrbZIP60, with presence of 4 and 2 heptad repeats of leucine respectively. STRING protein interaction analysis of G. raimondii bZIPs G. raimondii bZIP proteins interaction analysis is performed using STRING, protein-protein association network software, Fig. 4 deciphering the observed network (https://string-db.org/). Very strong interaction observed among subfamily A bZIPs, GrbZIP35 (ABF1/AREB1), GrbZIP39 (ABI5), GrbZIP65, GrbZIP66 (AREB3) and GrbZIP67 (DPBF2) genes, which are known for abscisic acid inducible stress regulation, seed germination and seedling development in plants (Lindemose et al. 2013;Skubacz et al. 2016). Observed interaction between bZIP subfamily A and D proteins of G.raimondii are predicted to be involved in signal transduction pathways, biotic and abiotic stress tolerance. Another strong interaction observed among GrbZIP17 and GrbZIP60, two endoplasmic reticulum or unfolded protein response modulators known to get activated under environmental stresses (Humbert et al.2012;Howell, 2013).
Interaction of GrbZIP35/ABF1 and GrbZIP55/GBF3 is also observed, these genes are involved in abiotic stress tolerance.
NLS sequence analysis of G. raimondii bZIP proteins NLS sequences from bZIP proteins of G. raimondii and Arabidopsis thaliana are mentioned in Table 4, some of the NLS sequences are sharing similarity among both species as well as in subfamily members. Subfamily C, G and S bZIP proteins are sharing similar NLS sequences.
A typical example of bZIP characterization A 3292bp bZIP17 gene, along with 5'UTR and 3'UTR was isolated from cotton and over-expressed in rice. Transgenic T 3 rice plants were characterized under salt and water stress. Higher transcript level of cotton bZIP17 was observed after 6 hrs of salt stress and non-signi cant change in the expression was observed under water stress, data not shown (Poster presentation in 5th "Plant Genetics and Genomics: Germplasm to Genome Engineering" conference, 2019; http://glostem.com/select/upload_ les/718_1_Plant%20Genetics%20and%20Genomics%202019%20Proceedings.pdf).

Discussion
We have identi ed, structurally characterized and classi ed in detail the important plant transcription factor family of bZIP proteins, in three genomes of cotton. This information can be used in further crop improvement efforts.
Phylogenetic analysis of G. hirsutum, G. arboreum and G. raimondii with Arabidopsis bZIP proteins shows close relation among functionally similar subfamilies according to their occurrence in same clades, thus suggesting bZIP family proteins are conserved in crucial biological functions in two dicot species. Even plant bZIP proteins are functionally similar in monocot and dicot species, G. arboreum and G. raimondii bZIPs are phylogenetically analyzed with Oryza bZIPs, data not shown. Among Gossypium bZIPs, subfamily A and S are representing maximum bZIP genes, subfamily S members are smaller in length, with one or two exons and subfamily B protein is the longest one.
Subfamily B, H, J and K of Gossypium bZIPs are single member families. Gossypium bZIPs Subfamily A, B, C, D, J, K and S members gene orthologs in different crop species are reported to function crucially in biotic-abiotic stress tolerance and starvation adaptation.
During discerning process of Gossypium bZIPs multiple bZIP genes with similar standard designation are observed, possibly due to gene duplication event. bZIP gene duplication may have happen during speciation of dicot species, speciation of Gossypium sub-genome species from common ancestor and during allotetraploidization. Whole-genome duplication in G. arboreum and G .raimondii before speciation is reported in previous reports. The ndings from this study indicate that Gossypium bZIP proteins are important in stress adaptation, developmental processes and need further evaluation for trait development like ber quality, yield and stress tolerance in cotton, which is beyond the scope of this study and should be undertaken by future researchers based on present bioinformatics ndings.

Conclusion
This study analyzed, annotated and phylogenetically classi ed bZIP proteins from cultivated cotton species G. hirsutum along with two sub-genome species G. arboreum and G. raimondii. Cotton bZIPs are classi ed into twelve subfamilies and eight subgroups. bZIP gene duplications are observed in all three cotton species. We have identi ed conserved functional motifs among different subfamilies of cotton bZIP proteins and correlated for the prediction of function along with reported function. Explored BRLZ domain structural analysis of G. raimondii bZIPs will be useful in further basic characterization of bZIP proteins of cultivated cotton species G. hirsutum. STRING protein interaction analysis of G. raimondii bZIPs resulted in prediction of interactions among A-D, B-K and C-S subfamily members.

Page 9/19
Phylogenetic analysis of this study will certainly help in the selection of speci c cotton bZIP genes according to the close alignment with Arabidopsis orthologs or sub-genome homolog for functional characterization. The authors declare that they have no con ict of interest.

Availability of data and material
All the data analyzed in this study is available in the manuscript and supplementary les. All identi ed bZIP gene ID, exon number, CDS length, protein length and protein sequences of G. hirsutum, G. arboreum and G. raimondii. G. hirsutum, G. arboreum and G. raimondii bZIP34, bZIP61, bZIP69 and bZIP76 protein alignment.   Leucine heptad repeats variation in G. raimondii bZIPs: basic region motif N-X7-R/K is highlighted in magenta colour, leucine position in luecine heptad motif is highlighted in gray, hydrophobic amino acids other than leucine are highlighted in blue and polar amino acids are highlighted in green