Phylogenomics of Plant NLR Immune Receptors to Identify Functionally Conserved Sequence Motifs

In recent years, the increase in genome sequencing across diverse plant species has provided a significant advantage for phylogenomics studies, allowing the analysis of one of the most diverse gene families in plants: nucleotide-binding leucine-rich repeat receptors (NLRs). However, due to the sequence diversity of the NLR gene family, identifying key molecular features and functionally conserved sequence patterns is challenging through multiple sequence alignment. Here, we present a step-by-step protocol for a computational pipeline designed to identify evolutionarily conserved motifs in plant NLR proteins. In this protocol, we use a large-scale NLR dataset, including 1,862 NLR genes annotated from monocot and dicot species, to predict conserved sequence motifs, such as the MADA and EDVID motifs, within the coiled-coil (CC)-NLR subfamily. Our pipeline can be applied to identify molecular signatures that have remained conserved in the gene family over evolutionary time across plant species. Key features • Phylogenomics analysis of plant NLR immune receptor family. • Identification of functionally conserved sequence patterns among plant NLRs.

In recent years, the increase in genome sequencing across diverse plant species has provided a significant advantage for phylogenomics studies, allowing the analysis of one of the most diverse gene families in plants: nucleotide-binding leucine-rich repeat receptors (NLRs).However, due to the sequence diversity of the NLR gene family, identifying key molecular features and functionally conserved sequence patterns is challenging through multiple sequence alignment.Here, we present a step-by-step protocol for a computational pipeline designed to identify evolutionarily conserved motifs in plant NLR proteins.In this protocol, we use a large-scale NLR dataset, including 1,862 NLR genes annotated from monocot and dicot species, to predict conserved sequence motifs, such as the MADA and EDVID motifs, within the coiled-coil (CC)-NLR subfamily.Our pipeline can be applied to identify molecular signatures that have remained conserved in the gene family over evolutionary time across plant species.[9].The protein sequences were compiled into a single fasta file named "NLRtracker_input_protein.fasta" (Dataset S1). 2. Annotate NLRs from protein sequences.
NLRs were annotated from the input protein sequence file "NLRtracker_input_protein.fasta" by running NLRtracker using the following command: ./NLRtracker -s NLRtracker_input_protein.fasta -o NLRtracker_output The output NLR protein sequences were saved as "NLR.fasta" in the "NLRtracker_output" folder.In total, we identified 1,862 NLRs from six representative plant species.

Published: Jul 05, 2024
Note: In a previous study, we used a tool, NLR-Annotator, to annotate NLR genes [11].However, since NLR-Annotator may not detect a few functionally validated NLRs (e.g., ADR1), we employed NLRtracker [2] in this protocol.Therefore, test datasets in the following analyses slightly differ from the data reported in Adachi et al. [9].
3. Extract specific NLR subfamily sequences based on phylogenetic analysis.
In a previous study [9], we characterized a conserved sequence pattern (MADA motif) crucial for CC-NLRs to trigger immune responses.To identify conserved sequence patterns in each NLR subfamily, we initially classified NLRs through phylogenetic analysis.Here, the NLR sequences obtained in step B2 were combined with 31 functionally characterized CC-NLRs and saved as "NLR_set.fasta"(Dataset S2).
Protein sequences in the input file "NLR_set.fasta"were aligned using MAFFT: mafft NLR_set.fasta> NLR_set_alignment_output.fasta For the phylogenetic analysis of the NLR family, NB-ARC domain sequences were extracted from the output alignment file "NLR_set_alignment_output.fasta" based on the NB-ARC domain sequence of Arabidopsis ZAR1 (Dataset S3).Extraction of NB-ARC domain sequences can be performed manually using alignment software or using our script (Supplemental script 1).In this script, protein sequences lacking the intact p-loop motif (G/AxxxxGKT/S) required for NLR protein function are automatically discarded from the dataset.Sequence gaps in the aligned NB-ARC domain sequences are also automatically deleted in this script.The sequences were saved as "NLR_set_alignment_NBARC_RemGap.fasta"(Dataset S4), which can be used as the input file for further phylogenetic analysis.

Note: We use conserved NB-ARC domain sequences for phylogenetic analyses of the NLR gene family because other domains, such as N-terminal domains and C-terminal LRR domain, are often too diversified and not suitable for inferring phylogenetic relations in NLRs.
The maximum likelihood phylogenetic tree was inferred by RAxML using the following command: raxmlHPC-PTHREADS-AVX2 -s NLR_set_alignment_NBARC_RemGap.fasta -n NLR_MLtree -m PROTGAMMAAUTO -f a -# 100 -x 1024 -p 121 Note: The '-f' and '-#' options were set for 100 iterations of bootstrap.The '-x' and '-p' options were random seeds.
NLRs that belong to the CC-NLR phylogenetic subclade were classified with functionally characterized CC-NLRs on the NLR phylogenetic tree output file "RAxML_bipartitions.NLR_MLtree" (Figure 2; Dataset S5).We extracted 1,305 protein IDs (Dataset S6) of the CC-NLR clade in the NLR phylogenetic tree using iTOL [14].For further sequence analysis, we extracted protein sequences of CC-NLRs from the input file "NLR_set.fasta"(Dataset S2) and "CCNLR_IDs.txt"(Dataset S6) using Supplemental script 2 and saved them as the output file "CCNLR_set.fasta"(Dataset S7).Provided N-terminal domain sequences were classified into several tribes in the output file "dump.out.blast_results.txt.l14"(Dataset S9).Among the output tribes, we focused on a tribe including ZAR1, RPP13, R2, and Rpi-vnt1.3(tribe 3) for further sequence analyses, as described in Adachi et al. [9].We then extracted IDs of N-terminal domain sequences from CC-NLRs grouped into tribe 3 using Supplemental script 4. For the analysis of conserved sequences, we extracted N-terminal domain sequences of tribe 3 from the input file "CCNLR_Ndomain_set.fasta"(Dataset S8) using Supplemental script 2 and saved as fasta file "Nseq_Tribe3.fasta"(Dataset S10).From our test data, we identified five conserved sequence patterns in the N-terminal domain of tribe 3 CC-NLRs (Figure 3).Among the identified motifs in the output "meme.html", a motif located at the very N terminus was defined as the MADA motif based on the deduced 21 amino acid consensus sequence "MADAxVSFxVxKLxxLLxxEx" [9], conserved in approximately 78% of tribe 3 CC-NLRs.The EDVID motif, which functions in stabilizing the structure of CC-NLR proteins [19], is conserved in approximately 85% of tribe 3 CC-NLRs (Figure 3).We set an HMM score cutoff at 10.0, which is most optimal for high-confidence searches of MADA containing CC-NLR proteins (MADA-CC-NLRs) [9].We also defined NLR proteins with HMM scores from 0 to 10.0 as MADA-like CC-NLRs.From our test data, we identified 108 MADA-CC-NLRs and 161 MADA-like CC-NLRs.Based on conserved sequence patterns of NLRs, we can predict evolutionally conserved molecular functions of NLRs and can apply this for mutant analyses in molecular biology, biochemistry, and cell biology experiments as described in recent studies [1,9,20,21].

Validation of protocol
This protocol or parts of it has been used and validated in the following research article(s): • Adachi et al. [9].An N-terminal motif in NLR immune receptors is functionally conserved across distantly related plant species.eLife (Figures 3-6).• Chia et al. [22].The N-terminal domains of NLR immune receptors exhibit structural and functional similarities across divergent plant lineages.Plant Cell (Figures 3 and 5).

Figure 3 .
Figure 3. Consensus sequence patterns detected in the N-terminal domain of tribe 3 coiled-coil nucleotide-binding leucine-rich repeat receptors (CC-NLRs).Conserved motifs were identified by MEME from 88 tribe 3 CC-NLR members.The motif logos describe the N-terminal consensus patterns, as reported in Adachi et al. [9]. Figure is modified from Adachi et al. [9].

Published: Jul 05, 2024 following
1. InterProScan 5.53-87.0InterProScan is a software that characterizes protein function.This program can be downloaded and installed by following the instructions provided at https://www.ebi.ac.uk/interpro/download/.It is compatible with 64-bit Linux operating systems.In this protocol, the InterProScan is utilized in the the instructions provided at https://mafft.cbrc.jp/alignment/software/. 4. RAxML v8.2.12 RAxML is a program for maximum likelihood-based inference of large phylogenetic trees.To download and install this program, please refer to PART Ⅳ and Ⅴ in the manual available at https://cme.hits.org/exelixis/resource/download/NewManual.pdf.Alternatively, RAxML-NG v1.2.1 can be downloaded and installed on Unix/Linux and macOS systems by following the instructions provided at https://github.com/amkozlov/raxml-ng?tab=readme-ov-file. /www.barleygenome.org.uk/,IBSC_v2) as used in Adachi et al.