Coding and noncoding transcriptomes of NODULIN HOMEOBOX (NDX)-deficient Arabidopsis inflorescence

Arabidopsis NODULIN HOMEOBOX (NDX) is a plant-specific transcriptional regulator whose role in small RNA biogenesis and heterochromatin homeostasis has recently been described. Here we extend our previous transcriptomic analysis to the flowering stage of development. We performed mRNA-seq and small RNA-seq measurements on inflorescence samples of wild-type and ndx1-4 mutant (WiscDsLox344A04) Arabidopsis plants. We identified specific groups of differentially expressed genes and noncoding heterochromatic siRNA (hetsiRNA) loci/regions whose transcriptional activity was significantly changed in the absence of NDX. In addition, data obtained from inflorescence were compared with seedling transcriptomics data, which revealed development-specific changes in gene expression profiles. Overall, we provide a comprehensive data source on the coding and noncoding transcriptomes of NDX-deficient Arabidopsis flowers to serve as a basis for further research on NDX function.

The molecular phenotype supporting the role of NDX in heterochromatin homeostasis was identified in genomic data of 10-day-old seedlings, however, similar functional studies were not performed in other stages of plant development. In particular, there is a lack of transcriptome data in Arabidopsis inflorescence, which awaits to be generated to understand the effect of NDX on the expression of flowering-related coding and noncoding RNAs. Herein, we extend our transcriptomic measurements to the flowering stage and present novel mRNA-seq and small RNA (sRNA)-seq data on inflorescence tissues of wild-type and ndx1-4 T-DNA insertion mutant (WiscDsLox344A04) plants ( Fig. 1). Our analysis identified specific sets of genes differentially expressed in ndx1-4, as well as noncoding heterochromatic siRNA (het-siRNA) loci whose activity was significantly changed in the absence of NDX. Transcriptomic data obtained from inflorescence were also compared with previously published seedling data 6 , highlighting development-specific changes in gene expression profiles.

Methods
Arabidopsis thaliana Col-0 wild-type accession and ndx1-4 mutant (WiscDsLox344A04) 5 plants were directly sown on soil and grown side-by-side at 21 °C on long day conditions (LD, 16 h light, 8 h dark photoperiod) until flowering stage. After five weeks, when the central flowering stem of plants reached approximately 20-25 cm and side branches were also present, inflorescence tissues were collected. Total RNA isolation was performed by the standard phenol-chloroform extraction method as described previously 8 . Briefly, approximately 30 mg plant material per sample was homogenized and resuspended in 600 μl of extraction buffer (0.1 M glycine-NaOH, pH 9.0, 100 mM NaCl, 10 mM EDTA, 2% SDS). The sample was mixed with an equal volume of phenol (pH 4.3, Sigma-Aldrich, P3803). The aqueous phase was treated with 600 μl of phenol-chloroform and chloroform, precipitated with ethanol. RNA samples were eluted in nuclease-free water and used in subsequent steps. To remove genomic contaminations, Dnase I treatments were performed based on manufacturer's instructions (Ambion AM2222, www.thermofisher.com); RNA quality was assessed by testing RNA degradation based on agarose gel electrophoresis and presence of contaminants by Nanodrop spectrophotometer measurements (260/230 and 260/280 ratios). RNA quality was further assayed by capillary electrophoresis (Agilent 2100).
For RNA-seq, cDNA libraries were prepared from two independent biological replicates according to Illumina's TruSeq RNA Library Preparation Kit v2 protocol. Equal amounts of total RNA extracts from 5 inflorescence tissues of different plants were pooled (to create one pool) for each biological replicate (see Tables 1, 2   www.nature.com/scientificdata www.nature.com/scientificdata/ for description of samples). The mRNA-seq libraries were sequenced using an Illumina NextSeq 500 instrument with 1 × 75 bp reads, generating 17.3-22.4 million reads per sample (Fig. 2a). Raw sequence data quality was assessed using FastQC and summarized reports were generated with MultiQC 9 . Duplication detection indicated >50% of reads as unique (Fig. 2b). Mean Phred quality scores for each read position in each mRNA sample were higher than 28 (indicative of 'very good quality') ( Fig. 2c). Per sequence mean Phred quality scores were higher than 28 (indicative of 'very good quality') for at least 95% of reads in each mRNA sample (Fig. 2d).
Next, Salmon 10 was used to map mRNA-seq reads and get transcript quantities for protein coding genes, noncoding genes and transposable elements. Reads were mapped to the Arabidopsis thaliana reference transcriptome (Araport11, coding and noncoding genes) to quantify gene expression levels as well as to transposable Sample Name % Dups % GC  Mean quality scores. The higher the score, the better the base call. Background color of the graph divides the y axis into very good quality calls (green zone), calls of reasonable quality (orange zone), and calls of poor quality (red zone). (d) Per sequence quality scores. The graph shows if a subset of sequences have universally low-quality values. Background color: of the graph very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red).

Data description ID/version
Reference sequence TAIR10  www.nature.com/scientificdata www.nature.com/scientificdata/ elements (Araport11) to quantify transposable element expression (see Table 3 for detailed description of Arabidopsis thaliana reference data used in the analyses). Transcript quantities of protein coding genes, noncoding genes, and transposable elements were corrected for potential fragment level GC bias using 'salmon quant-gcBias' that was shown to reduce isoform quantification errors 10,11 . Duplicate reads were filtered out from alignment to protein coding and noncoding genes (default behaviour of 'salmon quant') while duplicates were kept for transposable element alignments ('salmon quant' was used with '-keepDuplicates' option set).
DESeq2 was employed to identify differentially expressed genes (DEGs) and transposable elements (ndx1-4 versus Col-0 wild type) 12 . The significance level was established based on the adjusted p-values with independent hypothesis weighting 13 , where p(adjusted) <0.01 was used to define DEGs and p(adjusted) <0.05 for differentially expressed transposable elements. For the purpose of data visualization, RNA-seq reads were aligned to the TAIR10 reference genome using HISAT2 14 , which allowed for reporting spliced alignments. To create RPKM (Reads Per Kilobase per Million) normalized coverage files, we utilized deepTools bamCoverage 15 .
Description and accessibility of the generated datasets can be found in Tables 1, 2.
For the analysis of the noncoding transcriptome, sRNA libraries were prepared from 3-3 biological replicates according to Illumina's NEBNext ® Multiplex sRNA Library Prep protocol. For sRNA library preparations, equal amounts of total RNA extracts from 5 flowering tissues were pooled (to create one pool) for each biological replicate (see Tables 1, 2 for description of samples). Total RNA extraction was done as described above for mRNA-seq. The same total RNA prep was used for sRNA and mRNA sequencing. The sRNA-seq libraries were sequenced using an Illumina NextSeq500 instrument with 1 × 50 bp reads, generating 13.5-17.5 million reads per sample (Fig. 6a). Raw sequence data quality was assessed using FastQC and summarized reports were generated with MultiQC 9 . Duplication detection indicated 5-6 million unique reads per sample (Fig. 6b). Mean Phred quality scores for each read position in each sRNA sample were higher than 28 (indicative of 'very good quality') ( Fig. 6c). Per sequence mean Phred quality scores were higher than 28 (indicative of 'very good quality') for at least 95% of reads in each sRNA sample (Fig. 6d).
The sRNA-seq data underwent additional processing using the sRNAnalyzer pipeline 17 . The Illumina adapters were initially removed by Cutadapt 18 , and the reads were then size-selected to range between 19-25 nt. These size-selected reads were subsequently aligned to a published comprehensive sRNA locus database 19 (public data: www.ebi.ac.uk/ena/browser/view/PRJEB22276, additional material: https://github.com/seb-mueller/ Arabidopsis_smallRNA_loci) using sRNAnalyzer. To determine differential expression of sRNA loci between Col-0 and ndx1-4 samples, DESeq2 was employed (p(adjusted) <0.05), and the hits were further filtered by absolute fold change (abs(FC) >1.5). For data visualization, reads were aligned to the Arabidopsis thaliana reference sequence (TAIR10) with bowtie2 20

Data records
The mRNA-seq and sRNA-seq datasets generated from Arabidopsis inflorescence tissues were deposited in Gene Expression Omnibus (GEO) under the accession number GSE223591 21 .
Genome browser (JBrowse) tracks containing relevant transcriptomic data are available at: https://geneart.med.unideb.hu/pub/2023_ndx (login: ndx; password: athal23) The mRNA and sRNA datasets generated from Arabidopsis seedlings can be accessed at GEO under the accession number GSE201841 16 and in Supplementary Data 1-16 accompanying the related paper 6 .
Tables 1, 2 summarizes all transcriptomic experiments, NGS data, result tables, and identifiers to access relevant datasets. Table 3 lists all reference data used for analyses.

technical Validation
The genotypes of Col-0 and ndx1-4 mutant Arabidopsis lines were confirmed by PCR using different combinations of genotyping primers designed for the NDX locus and the T-DNA (WiscDsLox344A04) insertion site (Fig. 8). The integrity, purity, and yield of extracted total RNA were determined by agarose gel electrophoresis and nanodrop spectrophotometry (Fig. 9) as well as Agilent bioanalyzer measurements (resulting in RIN values >9). Lack of mRNA expression from the mutagenized ndx1-4 locus was confirmed by RT-qPCR analysis (Fig. 10). RNA samples were extracted from the 10-day-old seedling and inflorescence tissues (Col-0 and ndx1-4 genotypes). To remove genomic contaminations, Dnase I treatments were performed based on a.
b. c. d. mRNA-seq, differenƟally expressed genes ndx1-4, flower vs. seedling www.nature.com/scientificdata www.nature.com/scientificdata/ manufacturer's instructions (Ambion AM2222, www.thermofisher.com). These RNA samples were reverse transcribed with random hexamers using SuperScript IV reverse transcriptase (Thermo Fisher Scientific) as previously described 8 . For qPCR, we employed the ΔΔCt method and internal primers for normalization between samples. Primer sequences used for qPCR were as follows: NDX F AGCTGTAAAGTCAACTAACTGAGA; NDX R TCTAGATCCCATCTAACAAGAAACA; GAPDH1 F AGGAGCAAGGCAGTTAGTGGT; GAPDH1 R AGATGCGCCCATGTTCGTT. Real-time qPCR was subsequently conducted with a LightCycler 480 SYBR Green I Master mix (Roche) utilizing a QuantStudio 12 K Flex Real-Time PCR System (Thermo Fisher Scientific). NDX mRNA expression levels were normalized to GAPDH1 gene expression.

Usage Notes
We have created a comprehensive database of coding and noncoding transcriptomes of Arabidopsis inflorescence and seedling tissues that can be used for downstream genomic analysis. The NGS tracks provided can be displayed directly in freely accessible genome browsers. The list of differentially expressed genes can be directly used for molecular pathway analysis, GO term analysis or gene set enrichment analysis. The identified differentially regulated sRNA loci can be integrated with histone modification, DNA methylation, and transcriptomic maps for in-depth epigenomic analysis. Altogether, the data source we generated in Col-0 and ndx1-4 mutant plants is expected to lead us to a deeper understanding of the molecular function of NDX.

Code availability
No custom code was generated or applied for analysis of the genomic data presented. All software tools were referenced and used with default settings unless otherwise noted. Non-default parameters were as follows: FastQC (v0.11.9), multiqc (1.14): Quality check of mRNA-seq and sRNA-seq reads. Cutadapt (1.9.1): Adapter trimming of sRNA-seq reads (first pass parameters: -a AGATCGGAAGAGCACACGTCT -n 1 -e 0.2 -O 5 -m 1 --match-read-wildcards; second pass parameters: -g GTTCAGAGTTCTACAGTCCGACGA TC -n 1 -e 0.125 -O 8 -m 1 --match-read-wildcards). After adapter trimming, reads were size-selected (using standard linux command line tools) to keep 18-25 nt reads.