Mining GATA Transcription Factor Encoding Genes in The Cocoa Tree (Theobroma cacao L.) Suggests Their Potential Roles in Embryo Development and Biotic Stress Response

GATA transcription factors (TFs) are widely recognized as significant regulators, characterized by a DNA-binding domain that consists of a type IV zinc finger motif. This TF family has been widely investigated in numerous higher plant species. The purpose of the present work was to comprehensively analyze the GATA TF in cocoa plant ( Theobroma cacao L.) by using various bioinformatics tools. As a result, a total of 24 members of the GATA TFs have been identified and annotated in the assembly of the cocoa plant. According to phylogenetic analysis, these TcGATA proteins were classified into four distinct groups, including groups I (10 members), II (seven members), III (five members), and IV (two members). Next, our investigation indicated that the TcGA-TA proteins in different groups exhibited a high variation in their physic-chemical features due to their different protein lengths, gene structures, and conserved motif distributions, whereas the TcGATA proteins in the same clade might share the common conserved motifs. Additionally, the gene duplication of the TcGATA genes in the cocoa plant was also investigated. Of our interest, the relative expression levels of the TcGATA genes were investigated according to available transcriptome databases. The results exhibited differential expression patterns of all TcGATA genes in various developmental stages of zygotic and somatic embryogenesis, indicating that these TcGATA genes divergently function during various developmental stages of the zygotic and somatic embryos. Moreover, TcGATA genes were differently expressed under Phytophthora megakarya treatment across different points of treatment and co-coa varieties. To sum up, our findings could provide a basis for a further deep understanding of the GATAs in the cocoa plant.


INTRODUCTION
Theobroma cacao L. (2n = 2x = 20), origin in Amazonian lowland rainforests, was domesticated over 1,500 years ago (Motamayor et al. 2002) and has been grown in more than 50 countries in the world (Diaz-Valderrama et al. 2020;Jaimez et al. 2022).The chocolate tree is considered an economically important species due to its use in the cosmetics, and confectionery industries (Tan et al. 2021), chocolate production (Figueira & Scotton 2020), and medicinal benefits (Pucciarelli 2013).Chocolate tree plantations are often grown in agroforestry ecosystems alongside other fruit and commercial crops, thus providing lasting economic and environmental benefits (Guiltinan et al. 2008).Furthermore, the cocoa tree supplies essential livelihoods for 40 -50 million people worldwide (Wickramasuriya & Dunwell 2018).It would be significant to investigate the growth and developmental processes of this important chocolateproducing tree at the molecular level.Although the cocoa tree has been viewed as an experimental organism with some limitations in the research field (Figueira & Scotton 2020), its genome was an excellent resource permitting accelerated progress in plantation, breeding, and allowing the understanding of the biochemistry of this tree crop (Motamayor et al. 2013).
Plants are exposed to a variety of environmental stresses that reduce and limit crop productivity.To adapt and survive under these adverse conditions, plants are equipped with specific genes that confer tolerance to such stresses and also regulate their developmental processes.The link between these stress-response mechanisms and gene regulation is predominantly facilitated by transcription factors (TFs).Briefly, TFs are proteins that bind to specific DNA sequences, thereby controlling the transfer of genetic information from DNA to mRNA, playing a crucial role in turning genes on or off in response to environmental stimuli.Among them, GATA has been regarded as one of the ubiquitous TF families in plants that plays an essential role in many processes of plant development, metabolism and signal transduction (Reyes et al. 2004;Behringer & Schwechheimer 2015;Schwechheimer et al. 2022), and abiotic stress signalling (Gupta et al. 2017;Zhang et al. 2021;Zhao et al. 2021).Specifically, the GATA proteins shared a highly conserved type IV zinc finger motif (Schwechheimer et al. 2022) followed by a basic region facilitating DNA binding (Merika & Orkin 1993;Teakle et al. 2002;Reyes et al. 2004;Behringer & Schwechheimer 2015;Schwechheimer et al. 2022).Due to the large number of plant genomes available, the GATAs have been investigated in numerous dicotyledonous and monocotyledonous species, such as Arabidopsis thaliana (Teakle et al. 2002;Kim et al. 2021a), soybean (Glycine max) (Zhang et al. 2015), rice (Oryza sativa) (Gupta et al. 2017), apple (Malus × domestica) (Chen et al. 2017), grape (Vitis vinifera) (Zhang et al. 2018), cotton (Gossypium spp.) (Zhang et al. 2019), chickpea (Cicer arietinum) (Niu et al. 2020), pepper (Capsicum annuum) (C Yu et al. 2021), sweet cherry (Prunus avium) (Manzoor et al. 2021), cucumber (Cucumis sativus) (Zhang et al. 2021), potato (Solanum tuberosum) (R Yu et al. 2021), four Rosaceae species (Manzoor et al. 2021), purple false brome (Brachypodium distachyon) (Peng et al. 2021), Populus (Kim et al. 2021b), wheat (Triticum aestivum) (Du et al. 2022;Feng et al. 2022), foxtail millet (Setaria italica) (Lai et al. 2022), and peanut (Arachis hypogaea) (Li et al. 2023).Among them, the function of GATA TFs have been wellinvestigated in many plants (Gupta et al. 2017;Zhu et al. 2020;Zhang et al. 2021;C Yu et al. 2021;Feng et al. 2022;Li et al. 2023;Le et al. 2023).Unfortunately, even though the genome of the cocoa tree has been published recently (Motamayor et al. 2013), the characterization of the GA-TAs of this important economical species has not been described.
In this current study, we aimed to conduct a systematic investigation of the GATA TFs in the cocoa genome by using bioinformatics ap-proaches.Here, we performed genome-wide identification and characterization of the cocoa GATAs.To establish the evolutionary relationship of GATA TFs in cocoa with other plants, we also performed a comparative genomic analysis.Finally, the expression levels of the GATA genes were investigated by using previous RNA-Seq datasets.The obtained results will provide a cornerstone to understand various plant TF characteristics including evolutionary insights.

Identification and annotation of GATA family in cocoa tree
To identify all putative members of the TcGATA family in the cocoa genome, a TBLASTN search (Gertz et al. 2006) in the NCBI, Phytozome (Goodstein et al. 2012) and PlantTFDB databases (Jin et al. 2017) has been conducted against the recent genome of this species (BioProject accession: PRJEB14326) (Motamayor et al. 2013) using well-characterized AtGATA proteins from A. thaliana (Teakle et al. 2002) as queries (cut-off value < 10e-4).The Pfam database (Mistry et al. 2021) was then used to confirm all potential candidates that included the conserved GATA zinc finger domain (Pfam accession: PF00320).Afterward, the full-length protein, coding DNA (CDS) and genomic DNA (gDNA) sequences and the corresponding identifier of each TcGATA member were collected for further analyses.

Characterization of GATAs in cocoa tree
The exon/intron organization of TcGATA genes was constructed from the CDS and gDNA of each TcGATA member by using Gene Structure Display Server v2 (GSDS) (Hu et al. 2015) and the physicochemical features of the TcGATA proteins were calculated by the Expasy Protparam online tool (Gasteiger et al. 2003;Gasteiger et al. 2005) as previously described (Niu et al. 2020;Wang et al. 2021).Sub-cellular localization prediction of TcGATA proteins was performed by using the SherLoc2 program (Briesemeister et al. 2009).The gene ontology of TcGATA genes, including biological functions, cellular content, and molecular functions, was estimated by NETGO 2.0 (Yao et al. 2021) with scores higher than 0.8.The conserved motifs in the GATA TFs were screened by using the MEME web-based tool (Bailey et al. 2006).

Phylogeny and gene duplication analysis of the GATAs in cocoa tree
To generate the phylogenetic tree, the MAFFT program (Katoh & Standley 2013) was used to align the full-length protein sequences of TcGATA members from cocoa and well-characterized GATA proteins from other higher plant species, including A. thaliana (Teakle et al. 2002;Kim et al. 2021a), apple (Chen et al. 2017), Populus (Kim et al. 2021b), grape (Zhang et al. 2018) and rice (Gupta et al. 2017).The Maximum likelihood (ML) phylogenetic tree was generated using MEGA version 11 software (Tamura et al. 2021) with the bootstrap test replicated 1000 times.
Gene duplications were determined as previously described (Guo et al. 2015).The ratio between Ka (the number of nonsynonymous substitutions per non-synonymous site) and Ks (the number of synonymous substitutions per synonymous site) values were calculated by using MEGA version 11 (Tamura et al. 2021) and DNASp version 6 tools (Rozas et al. 2017).

Analysis of the expression profiles of the GATAs in cocoa
The expression features of TcGATA genes were detected at different de-velopmental stages of zygotic and somatic embryos by investigating data in a public database (GEO accession: GSE55476) (Maximova et al. 2014) available from the NCBI GEO (Barrett et al. 2013).Additionally, the expression profiles of the TcGATA genes under pathogen infection were investigated by analyzing the previous microarray atlas (GEO accession: GSE116041) (Pokou et al. 2019).Relative expression values of TcGATA genes were estimated using Actin 11, the most stable expressed gene in various cocoa tissues (Pinheiro et al. 2011), as a reference gene, following the previous description (Cao 2022).Up-and down-regulated genes were defined by a fold-change cut-off (|fold-change| ≥ 1.5-fold) between 6, 24, and 72 hours after inoculation (hai) and 0 hai.

RESULTS AND DISCUSSION
Identification and annotation of the TcGATA proteins in cocoa tree A total of 24 putative TcGATA genes were identified in the cocoa genome (Table 1), along with their annotations, like Phytozome locus and their corresponding sequences.Finally, we assigned these 24 GATA fulllength protein sequences to TcGATA01 to TcGATA24 based on their physical location on the genome (Table 1).
Recently, there has been a significant effort to identify and characterize GATA TFs in various higher plant species, including both dicotyledonous and monocotyledonous plants (Table S1).Compared to other plants, the TcGATA family found in the cocoa genome is larger than in sweet cherry (18 genes) (Manzoor et al. 2021) (Feng et al. 2022).These comparisons reveal that the number of GATA members varies greatly across different plant species.

Analysis of the physical and chemical features of the TCGATA proteins in cocoa
The predicted full-length amino acid sequences of 24 TcGATA members in cocoa were analyzed using the ExPaSy Protparam tool (Gasteiger et al. 2003(Gasteiger et al. , 2005)).The investigation provided information on the physical and chemical properties of the TcGATAs in cocoa, which are summarized in Table 1.The full-length of the predicted protein sequences encoded by the 24 TcGATAs varied from 119 (TcGATA04) to 538 amino acid residues (TcGATA10).The weights of the TcGATAs ranged from 13.25 kDa (TcGATA04) to 59.88 kDa (TcGATA07).The theoretical isoelectric point (pI) values of the TcGATAs were distributed between 4.69 (TcGATA18) and 10.14 (TcGATA14), with 12 TcGATAs being acidic (pI values ranging from 4.69 to 6.67) and the remaining sequences being basic (pI values varying from 7.71 to 10.14).The aliphatic index (AI) values of the TcGATAs ranged from 41.16 (TcGATA02) to 77.86 (TcGATA06).Additionally, the grand average of hydropathicity (GRAVY) values for all members of the TcGATAs in cocoa were less than 0, indicating that the TcGATAs were hydrophilic proteins (Table 1).
The obtained results were in agreement with the previously comprehensive analysis conducted on the general characteristics of GATA TFs in various higher plant species.For instance, GATAs in Rosaceae woody species, like Pyrus bretschneideri, Prunus avium, P. mume, and P. persica, were reported to range from 119 to 548 amino acid residues in fulllength sequences and from 12.99 to 60.23 kDa in molecular weight, respectively (Manzoor et al. 2021).In grapes, the protein full-lengths of GATA TFs ranged from 109 to 386 amino acid residues (Zhang et al. 2018), while apples had GATAs with amino acid residues varying from 90 to 1161 (9.9 to 129.74 kDa) (Chen et al. 2017).Additionally, seven Populus species were found to possess a total of 389 predicted GATA proteins, with sequence lengths ranging from 82 to 791 amino acid residues, except for PtsGATA29, which had only 46 amino acid residues (Kim et al. 2021b).Additionally, the pI values of GATA TFs in higher plant species were found to a wide range from acidic to base, with pI scores in four Rosaceae woody species ranging from 4.71 to 10.07 (Manzoor et al. 2021), and peanut (Arachis hypogaea) ranging from 4.75 to 10.21 (Li et al. 2023), respectively.These varying pI scores are due to their different protein lengths.Interestingly, the GRAVY values of all members of GA-TAs in apples (Chen et al. 2017), four Rosaceae species (Manzoor et al. 2021), and peanuts (Li et al. 2023) were evidently negative, indicating that these GATAs may be hydrophilic (Schwechheimer et al. 2022).Overall, the physical and chemical properties of TcGATAs in cocoa, and possibly other plant species, were highly variable based on the study results.The dissimilarity of the physical and chemical properties proposed that GATAs might play different functions in plants.

Phylogenetic analysis, Gene structure and Conserved Motif of the GATAs in cocoa
A phylogenetic tree comprising all of the 24 TcGATAs and wellcharacterized GATAs from A. thaliana (Teakle et al. 2002;Kim et al. 2021a), grape (Zhang et al. 2018), and P. trichocarpa (Kim et al. 2021b) has been constructed in order to clarify the phylogenetic relationships of the TcGATAs in cocoa (Figure 1).
According to the ML estimation, 24 TcGATA proteins were divided into four different groups, namely groups I, II, III, and IV, respectively, as shown in Figure 1.Specifically, group I had the largest number of cocoa TcGATA proteins (10 TcGATA members), followed by group II (seven TcGATA members), and group III (five TcGATA members).Group IV had the least number of cocoa TcGATA proteins, with only two members, including TcGATA06 and TcGATA10, respectively (Figure 1).Previously, the classification of GATA TFs had been established for different higher plant species (Table S1).GATA TFs from many a large number of dicotyledonous plants were also classified into four main clades, like A. thaliana (Teakle et al. 2002), four Rosaceae species (Manzoor et al. 2021), peanuts (Li et al. 2023), and grape (Zhang et al. 2018).Out of four clades, clades I and IV contained the largest and smallest number of GATAs, respectively.Particularly, 49 GATA TFs found in potato (Solanum tuberosum) were divided into five groups, group II had the largest number of GATA proteins (15 GATA members), followed by groups IV (13 GATA members), V (10 GATA members), III (eight GA-TA members), and I (only three GATA members) (R Yu et al. 2021).But the phylogenetic tree was constructed from only the 49 GATAs of potato.Therefore, the classification of potato GATAs might need further comparison with other species.The clade V, VI, and VII GATAs were reported only in rice with two, four, and two members, respectively (Gupta et al. 2017).
The exon/intron arrangement of the cocoa TcGATA genes was then examined.The results showed that newly discovered cocoa TcGA-TA family genes have exon counts ranging from 1 to 11 (Figure 2).Interestingly, the TcGATA genes in the same clade may share the seminar gene organization (Figure 1, 2).For example, eight (out of ten) members in group I included two exons, except for TcGATA03 had only one exon, and TcGATA01 contained three exons (Figure 2).Seven (out of nine) TcGATA members of group II contained three exons, while two others had two exons (TvGATA02 and TcGATA04).In group III, a majority member had seven exons, whereas two remaining members had 10 (TcGATA18) and 11 exons (TcGATA08) (Figure 2).Two group IVbelonging TcGATA genes included four (TcGATA06) and eight (TcGATA10) exons, respectively (Figure 2).Our findings confirmed the wide range of variability in exon numbers of GATA genes identified in four Rosaceae species (from 1 to 10 exons), peanut (Chen et al. 2017), grape (from 1 to 11 exons) (Zhang et al. 2018), and Populus species (from 1 to 12 exons) (Kim et al. 2021b).The unique gene architecture of TcGA-TA15 and TcGATA17, characterized by their extremely short exons and long introns, offers fascinating insights into evolutionary processes.This atypical structure, contrasting with more common gene architectures, aligns with theories suggesting that the evolution of introns is linked to alternative splicing and the resulting functional diversity in proteins.The presence of long introns in these two genes could potentially delay transcriptional output, providing a mechanism for the suppression of gene expression under adverse conditions.This hypothesis aligns with the broader concept that gene architecture can be an adaptive trait in evolution, where specific structural features, like long introns, may confer selective advantages in response to environmental challenges.The divergence in the gene structure of the GATA family of cocoa suggests that the TcGATA genes underwent an evolutionary change, which might have generated the functional separation of the GATA family and might enable genes to have new functions that can help plants better adapt to environmental changes (Fan et al. 2014).
The evolutionary relationship and classification of the TcGATA family were validated by analyzing their conserved motifs predicted and confirmed by the MEME program (Bailey et al. 2006) (Figure 3).The conserved motif 1 was present in all TcGATA proteins, and the majority of members of the same TcGATA group exhibited similar patterns.Group IV had the lowest number of conserved motifs (1), while group I had the highest (6).However, some proteins had distinct conserved mo- tifs across different groups.For instance, motif 1 was unique to all groups.All members of group I contained motifs 2, 5 except for two genes (TcGATA11 and TcGATA23), motif 7 except for two genes (TcGATA03 and TcGATA23), and motif 8 except for two genes (TcGATA03 and TcGATA13).In addition, motif 10 was found in two members, TcGATA16 and TcGATA19, respectively.For group II, motif 6 was detected in five (out of seven) members (TcGATA04, TcGATA12, TcGATA14, TcGATA20 and TcGATA21).All members of group III contained motifs 3, 4, and 9.Moreover, motif 10 was recorded in two genes, TcGATA09 and TcGATA15, respectively.The common motif detected in all TcGATA was the zinc finger loop (C-X2-C-X18-20-CNAC) domain.GATA members in groups I, II, and IV had the C-X2-C-X18-CNAC conserved domain, while the group III members harboured the C-X2-C-X20-CNAC domain (Figure 6).Additionally, the conserved amino acid motif TPQWRXGPXGXKTL was identified between the second and third cysteine residues in the C-X2-C-X18-CNAC zinc finger loop of group I while the conserved amino acid motif TX2T-PLWRXGPXGPKXL was detected between the second and third cysteine residues in the C-X2-CX18-CNAC zinc finger loop of group II.Moreover, the conserved amino acid motif GXSX3TPXMRRGPXGPRXL was detected between the second and third cysteine residues in the C-X2-CX20-CNAC zinc finger loop of group III and the conserved amino acid motif GX2STPLWRNGPPEK-PVL was identified between the second and third cysteine residues in the C-X2-CX18-CNAC zinc finger loop (Figure 4).Our findings showed that the motif distributions of TcGATA proteins were comparable within each subfamily.The presence of the same motif in all groups or in each group suggested that they might have fundamental functions.The conserved GATA domains and motifs found between the second and third cysteine residues in the C-X2-CX18-20-CNAC zinc finger loop found in different groups of TcGATAs were consistent with conserved structures previously identified uin peanut (Li et al. 2023), chickpea (Niu et al. 2020), and Populus species (Kim et al. 2021b).The examination of gene architectures and conserved motifs reveals that GATA members within a group exhibit relatively high conservation properties in various species and that members among groups exhibit reasonably high conservation properties.

Physical distribution and gene duplication of the GATAs in cocoa
The distribution of the 24 TcGATA genes across the cocoa genome was investigated in this study.Results showed that the TcGATA gene family was distributed randomly across the genome (Figure 5).The quantity of TcGATA genes differs across various chromosomes, with chromosomes 9 and 16 containing the largest number of TcGATA gene distributions with five members, followed by chromosome 2 with four members, and chromosomes 4, 5, 6, and 8 with two members each (Figure 5).It is noteworthy that chromosomes 3 and 10 each only had one GATA gene, while chromosome 7 had no TcGATA gene (Figure 5).
As an intriguing aspect of this research, the duplication events that occurred in the TcGATA gene family in cocoa were predicted as previously described (Niu et al. 2020), with details provided in Figure 5 and Table 2. Three duplicate genes were found in the TcGATA family, with nucleotide similarities ranging from 53.3 (TcGATA08 and TcGATA18) to 57.3% (between TcGATA09 and TcGATA17).These findings indicate that whole genome duplication (WGD) and segmental duplication (SD) events played a significant role in the expansion of the TcGATA gene family.Additionally, the Ka/Ks ratios for the three duplicated genes were all less than 1, ranging from 0.26 (TcGATA08 and TcGATA18) to 0.30 (TcGATA20 and TcGATA21), indicating that the TcGATA genes were under strong purifying selection.
These findings showed a similar trend in chromosome distribution and evolution of GATA genes in many plant species.The random distribution of GATA genes across the genome was reported in seven Populus species (Kim et al. 2021b), four Rosaceae species (Manzoor et al. 2021) and chickpea (Niu et al. 2020).Interestingly, there was no tandem duplication observed in the TcGATA family, while two WGD and one SD events accounted for the duplication events in the TcGATA family (Table 2).This result confirmed that the WGD and SD events played a significantly important role in the evolution of the GATA genes compared to tandem duplication events, as also observed in chickpeas (Niu et al. 2020), Populus species (Kim et al. 2021b), grape (Zhang et al. 2018), and perhaps many other plants (Zhang et al. 2019;Li et al. 2023).Note: WGD: Whole genome duplication, SD: Segmental duplication, Ka: the number of nonsynonymous substitutions per non-synonymous site, Ks: the number of synonymous substitutions per synonymous site.

Gene ontology analysis of the GATAs in cocoa
In this study, gene ontology (GO) analysis was used to annotate the probable roles of the TcGATA TFs.Appropriately, 24 TcGATAs were then categorized into 55 functional groups and divided into three main ontologies, including cellular component, biological process, and molecular function (Figure 6).As a result, in the cellular component category, all 24 TcGATAs anticipated their function in the nuclear, intracellular organelle, while only one member awaited the role in the intracellular, non-membrane-bounded organelle (Figure 6).The GO analysis also indicated that all TcGATAs were distributed in the nucleus (Figure 6), which was also confirmed by the sub-cellular localization prediction by the SherLoc2 tool (Table 1).All TcGATAs were localized in the nuclear compartment followed by the cytoplasm (seven out of 24) (Figure 6).While only one member of the TcGATA was predicted to localize on the Golgi apparatus, vacuolar, plasma membrane, or extracellular (Table 1).It has been thought that the determination of the sub-cellular localization of proteins can provide insight into their potential roles (Goodin 2018).
In the molecular function category, all TcGATA proteins were predicted to act as TFs (DNA binding, nucleic acid binding, and sequence-specific DNA binding) (Figure 6).Under the biological process annotation, all TcGATAs were associated with biological processes, and 18 out of 24 TcGATAs anticipated their function in the regulation of biological processes.In addition, 11 out of 24 TcGATAs were predicted to function in response to stimuli.These obtained results were also in agreement with the previous reports that the 32 PbGATAs of Chinese white pear anticipated their functionality in DNA binding, and nucleic acid binding TF (Manzoor et al. 2021).

Expression patterns of the cocoa TcGATAs
In this study, of our interest, the expression pattern of the TcGATA genes in different stages of cocoa embryo development was investigated (Figure 7).In general, most TcGATA genes were expressed in all stages of embryo development, except for TcGATA14 (Figure 7).The expressed TcGATA genes exhibited different expression levels during different developmental stages of the embryo (Figure 7).Sixteen TcGATA genes were differentially expressed during zygotic embryo maturation, with 11 TcGATA genes displaying higher expression levels in the mature zygotic embryo tissues than in other developmental stages, including TcGA-TA03, TcGATA05, TcGATA07, TcGATA11, TcGATA22 (group I), TcGATA04, TcGATA20, TcGATA21 (group II), TcGATA08, TcGATA18 (group III), and TcGATA10 (group IV), respectively (Figure 7).However, four TcGATA genes belonging to group I showed lower expression levels in mature zygotic embryo samples than in early developmental stages, including TcGATA02, TcGATA03, TcGATA23, and TcGATA24 (Figure 7).Similarly, five TcGATA genes exhibited higher expression at mature (M-SE) than late torpedo (LT-SE) developmental stages of somatic embryogenesis, including TcGATA05, TcGATA07, TcGATA11, TcGATA04, and TcGATA18, respectively.However, TcGATA16 showed a lower expression level at M-SE than LT-SE developmental stages of somatic embryogenesis (Figure 7).At the same developmental stages, differential gene expression between zygotic and somatic embryogenesis was recorded.At the torpedo stage, four TcGATA genes (TcGATA02, TcGATA03, TcGATA16, and TcGATA23, respectively) were more expressed in zygotic embryos compared to somatic embryo while three other genes (TcGATA04, TcGATA07, and TcGATA10, respectively) were less expressed.At the mature stage, 11 TcGATA genes (TcGATA01, TcGATA03, TcGATA04, TcGATA08, TcGATA09, TcGATA17, TcGA-TA18, TcGATA20, TcGATA21, TcGATA22, and TcGATA23, respectively) had higher expression levels in zygotic embryos compared to somatic embryo while three other genes (TcGATA07, TcGATA16, and TcGA-TA19, respectively) had lower expression levels (Figure 7).Overall, the expression of TcGATA genes in embryogenesis suggested that this transcription family played an important role in the seed development of cocoa.The differential expression patterns of different genes in various developmental stages of zygotic and somatic embryogenesis indicated that different TcGATA genes divergently function during various developmental stages of the zygotic and somatic embryos.Despite the large number of reports of the genome-wide analysis of GATAs in plants, the function of this family in embryogenesis has been poorly communicated.
Earlier, the GATA factor HANABA TARANU was reported to be required to position the proembryo boundary in the early embryo of A. tha-liana (Nawy et al. 2010).On the other hand, the expression of two GA-TAs (GATA NITRATE-INDUCIBLE CARBON-METABOLISM-INVOLVED and CYTOKININ-RESPONSIVE GATA1) in Arabidopsis seedlings has been described (Chiang et al. 2012).In addition, the BME3 (Blue Micropylar End 3) GATA TF has been previously described as a positive regulator of Arabidopsis seed germination (Liu et al. 2005).So, our findings provided evidence that indicated the function of GATAs in embryo development in plants.Moreover, further deep investigation might be required to explore the role of GATAs in the seed development of the seed crop species.A significant number of TcGATA genes showed differential expression under Phytophthora megakarya treatment across different points of treatment and cocoa varieties (Figure 8).Particularly, the expression of TcGATA22 was not detected in any treatments of both Nanay and Scavina genotypes.At 6 hours after inoculation (hai), only two genes, TcGA-TA06 and TcGATA23, showed an increase in relative expression level in the Nanay genotype.However, in the Scavina genotype, four genes, including TcGATA05, TcGATA08, TcGATA12 and TcGATA18, were upregulated by P. megakarya treatment, whereas TcGATA03 and TcGA-TA04 were down-regulated.At 24 and 72 hai, in Nanay genotype, five genes, including TcGATA04, TcGATA05, TcGATA13, TcGATA17, and TcGATA19, were down-regulated by P. megakarya treatment, and only two genes, TcGATA06 and TcGATA07, were up-regulated.Differently, in the Scavina genotype, eight genes, including TcGATA01, TcGATA04, TcGATA05, TcGATA06, TcGATA12, TcGATA16, and TcGATA20, were up-regulated.Moreover, at 72 hai, most of the expressed TcGATA genes were up-regulated by P. megakarya treatment, except for TcGATA03, which was down-regulated and four genes, TcGATA07, TcGATA10, TcGATA15, and TcGATA19, which were not regulated by P. megakarya treatment (Figure 8).In summary, TcGATA genes showed different expression patterns in the susceptible (Nanay) and resistant (Scavina) cocoa genotypes under P. megakarya treatment at different time points (6, 24, and 72 hours) after inoculation.These discovered results indicated that TcGATA genes function differently under P. megakarya treatment in various genotypes of cocoa tree.The increase in relative expression level from 6 hai to 72 hai in the tolerance genotype contributed to explaining the function of TcGATAs in the biotic stress response in cocoa.In the literature, expression pattern analysis exhibited that GATA genes responded to diverse abiotic stresses, such as high temperature, salinity, cold, and drought treatments, in many plants, such as rice (Gupta et al. 2017), wheat (Feng et al. 2022), oilseed rape (Zhu et al. 2020), cucumber (Zhang et al. 2021), andpepper (R Yu et al. 2021).However, knowledge about the function of GATAs in the biotic response was limited until recently.For example, overexpression of TaGATA1 showed high resistance to Rhizoctonia cerealis in wheat (Liu et al. 2020).A further detailed investigation into the role of TcGATA in Phytophthora might be necessary, as cocoa undergoes significant annual losses to the water mold Phytophthora spp.(Oomycetes) (ranging between 20 and 25% of global losses) (Adeniyi 2019).

CONCLUSIONS
This present study focused on the identification and characterization of the GATA TF family in cocoa tree.A total of 24 TcGATA genes were identified in the assembly of cocoa.By using various tools, the physicobiochemical features, gene structure, and conserved motifs of the TcGATA proteins were analyzed.The gene expression patterns of the TcGATAs were investigated during the development of zygotic and somatic embryos.Moreover, their expression patterns under inoculation with P. megakarya were also analyzed.The results provide valuable information for further understanding the different functions of TcGATAs during seed development and in response to P. megakarya in cocoa plants.Additionally, these findings offer insightful information for comparative genomics studies in plants based on the characterization, evolution and expression of GATA gene family.

AUTHOR CONTRIBUTION
N.T.B.C. contributed to the research design, data collection and analysis, and preparation of the first draft of the manuscript, T.M.L. contributed to data collection, H.D.C. contributed to data collection and analysis, and preparation of the first draft of the manuscript.T.T.T.H. contributed to data collection and analysis.L.T.M.T. contributed to data collection and analysis, H.V.L. contributed to the research design, data collection and analysis, Q.T.X.V. contributed to data collection and analysis, H.H.P. contributed to data collection and analysis, V.T.T. contributed to data collection, P.B.C. contributed to the research design, data collection and analysis, and preparation and editing of the manuscript and to supervise all the process.S1.Number of GATA genes in each group of some plant species used in genome-wide identification of the GATA gene family.* This number is from the recent analysis (Kim et al. 2021b).*** In the case of four species, different classification, group A, B, C, and/or D, was used so that it is also omitted (group A: 15 GATA genes, group B: 5 GATA genes, group C: 7 GATA genes, and group D: 1 GATA genes in Brachypodium distachyon (Peng et al. 2021), group A: 12 genes, group B: 9 genes, group C: 4 genes, and group D: 3 genes in Capsicum annuum (C Yu et al. 2021), group A: 17 GATA genes, group B: 5 GATA genes, and group C: 3 GATA genes in Cicer arietinum (Niu et al. 2020), group A: 11 genes, group B: 9 genes, group C: 4 genes, and group D: genes in Cucumis sativus (Zhang et al. 2021)).In case of Zea mays, different classification, goup I, II, III, IV, V, and VI, was used so that it is also omitted (group I: 5 genes, group II: 0 gene, goup III: 7 genes, goup IV: 3 genes, group V: 5 genes, group VI: 3 genes (

Figure 2 .
Figure 2. Gene exon/intron organizations of the GATA family in cocoa.

Figure 3 .
Figure 3. Conserved motifs of the GATA family members in cocoa generated by MEME.

Figure 4 .
Figure 4. Alignments of GATA domains of all identified TcGATA family members in cocoa tree.

Figure 5 .
Figure 5.The chromosomal distribution of TcGATA genes in the cocoa genome.The red lines indicated the duplication events.The chromosome number is indicated above for each chromosome.

Figure 6 .
Figure 6.GO analysis involving in molecular function, biological processes, and cellular components of TcGATAs investigated by NETGO 2.0.

Figure 7 .
Figure 7. Expression patterns of the T. cacao GATA genes during Zygotic (ZE) and Somatic Embryo (SE) maturation.Values represented log2 of the relative expression level of TcGATA genes per expression level of Actin 11 gene which was the most stable expressed gene in various tissues (Pinheiro et al. 2011).T-ZE: Torpedo zygotic embryo, EF-ZE: Early-full zygotic embryo, LF-ZE: Latefull zygotic embryo, M-ZE: Mature zygotic embryo, LT-SE: Late Torpedo somatic embryo, M-SE: Mature somatic embryo, nd: non-determined.

Table 2 .
Prediction of the duplication events in the TcGATA gene family in cocoa.