Evolution of a Complex Disease Resistance Gene Cluster in Diploid Phaseolus and Tetraploid Glycine 1[W]

We used a comparative genomics approach to investigate the evolution of a complex NB-LRR gene cluster found in soybean ( Glycine max ) and common bean ( Phaseolus vulgaris ) that is associated with several disease resistance ( R ) genes of known function including Rpg1b, an R gene effective against specific races of bacterial blight. Analysis of domains revealed that the N-terminal coiled-coil (CC) domain, central nucleotide binding (NB-ARC), and C-terminal leucine-rich repeat (LRR) domains have undergone distinct evolutionary paths. Sequence exchanges within the NB-ARC domain were rare. In contrast, inter-paralogue exchanges involving the CC and LRR domains were common, consistent with both of these regions co-evolving with pathogens. Residues under positive selection were over-represented within the predicted solvent-exposed face of the LRR domain, although several also were detected within the CC and NB-ARC domains. Superimposition of these latter residues onto predicted tertiary structures revealed that the majority are located on the surface, suggestive of a role in interactions with other domains or proteins. Following polyploidy in the Glycine lineage, NB-LRR genes have been preferentially lost from one of the duplicated chromosomes (homoeologues), and there has been partitioning of NB-LRR clades between the two homoeologues. The single orthologous region in Phaseolus contains approximately the same number of paralogues as found in the two Glycine homoeologues combined. We conclude that while polyploidization in Glycine has not driven a stable increase in family size for NB-LRR genes, it has generated two recombinationally isolated clusters, one of which appears to be in the process of decay. gene cluster groups of shows substantial structural demissum

. Genome duplication thus might be expected to cause an increase in R gene number and diversity. However, analyses of the Arabidopsis thaliana (L.) Heynh. genome, which is believed to have undergone at least two WGD events (Bowers et al., 2003), indicate that R genes were preferentially lost following polyploidy (Cannon et al., 2004;Nobuta et al., 2005), leading to retention of almost no duplicated R genes following whole genome duplication. This enhanced loss of R genes suggests there may be a fitness cost associated with R genes following duplication. In cases where genome duplication is the result of allopolyploidy (i.e., combining genomes from two different species or subspecies), fitness costs could stem in part from autoimmune-type responses in which R genes from one genome are activated in the genomic context of the other genome (Bomblies and Weigel, 2007).
We have been evaluating the impact of WGD on the evolution of a complex R gene cluster in soybean (Glycine max (L.) Merr.) (Innes et al., 2008). The genome of the ancestor of extant Glycine species, including soybean, underwent a WGD, likely as a consequence of allopolyploidy (Gill et al., 2009), and is therefore an excellent species with which to investigate the consequences of this type of event. Analysis of silent site substitution rates (Ks) between gene duplicates has been used to estimate a homoeologue divergence time of approximately 13 million years (Schlueter et al., 2004;Egan and Doyle, 2010;Schmutz et al., 2010), which thus provides a maximum age for this WGD event (Gill et al. 2009).
Prior to the release of the complete soybean genome sequence (Schmutz et al., 2010), we sequenced an approximately 1 megabase (Mb) region of soybean cultivar Williams 82 centered on the Rpg1b disease resistance gene on chromosome 13 (Innes et al., 2008). This region contains numerous NB-LRR type genes, one of which encodes Rpg1b, a gene effective against certain races of bacterial blight (Pseudomonas syringae pv. glycinea) (Ashfield et al., 2004). Also mapping to the vicinity are other R genes of known function including those effective against bacterial, viral and oomycete pathogens (Diers et al., 1992;Yu et al., 1994;Ashfield et al., 1998;Sandhu et al., 2005).
To allow us to assess the impact of polyploidy on the Rpg1b region, we also sequenced the homoeologous (duplicated by polyploidy) segment on chromosome 15 (Innes  et al. 2008). This allowed us to compare the evolution of this R gene rich region in the two duplicated segments generated by polyploidy. In addition, to gain information on the likely ancestral state of this region prior to polyploidy, we sequenced the single orthologous (separated by speciation) region from common bean (Phaseolus vulgaris L.) (Innes et al. 2008), which diverged from Glycine approximately 19 million years ago (mya) (Lavin et al., 2005) and has not undergone the ≤ 13 mya whole genome duplication event. In common bean, this region is referred to as the Co-2 locus and contains R genes effective against diverse pathogens, including the fungus Colletotrichum lindemuthianum (Geffroy et al., 1998) and P. syringae pv. phaseolicola expressing AvrRpm1 (Chen et al., 2010). Finally, to allow us to assess more recent changes within the Glycine homoeologues, we sequenced the orthologous regions in a second G. max accession (PI 96983) and also in a perennial Glycine species, G.
tomentella Hayata (Innes et al. 2008). The G. max and G. tomentella lineages are thought to have diverged about 6 mya (Egan and Doyle, 2010). The phylogenetic relationships among these taxa are shown in Fig. 1A.
Comparison of the Glycine homoeologues revealed a remarkably high retention of non-NB-LRR gene duplicates in this region following polyploidy (~77%), most of which are still expressed (Innes et al., 2008). This was interesting given that the target region in Glycine homoeologue 2 now resides in a percentromeric region rich in retrotransposons and displaying suppressed recombination (Innes et al., 2008;Wawrzynski et al., 2008).
In contrast, although NB-LRRs have been retained in both the duplicated regions, we found evidence for partitioning of ancestral NB-LRR lineages (lineages defined by phylogenetic analysis of the NB-ARC region (van der Biezen and Jones, 1998)) between the two homoeologues in Glycine. Also apparent was relatively recent species / homoeologuespecific expansion of individual lineages (Innes et al., 2008).
Although we reported an initial analysis of the above genomic regions (Innes et al. 2008), we now have expanded our analysis by delving further into the evolutionary history of the NB-LRR genes in this region of the Glycine genome. In particular, we now ask: (1) evolution of the family and therefore facilitate generation/retention of greater diversity as compared to that in a diploid ancestor? (6) Is there evidence for diversifying selection acting on specific codons of NB-LRR genes, and if so, can combining this information with modeling of R protein tertiary structure provide insights into NB-LRR protein function?

Different Fates of the NB-LRR Family in the Two Homoeologous Regions Derived from the Ancestral Rpg1b Region Following Polyploidization
To better understand the evolutionary processes that have driven the divergence of the CC-NB-LRR genes (henceforth referred to simply as NB-LRR ORFs) found in the Rpg1b region in soybean and its relatives, we first defined ten physical intervals within the ~ 1Mb target region centered on the soybean Rpg1b gene, and the corresponding orthologous and homoeologous regions that contained NB-LRR ORFs (s) (A-J in Fig. 1B). Each interval is defined by flanking low-copy gene(s) conserved in most/all of the homologous regions being compared.
Although NB-LRR ORFs are present in both Glycine homoeologous regions H1 and H2, these ORFs display only limited conservation of synteny based on alignments of flanking low-copy genes ( Fig. 1B and Table I). This lack of co-linearity suggests that the NB-LRR genes have undergone many deletions and/or duplications following the divergence of the homoeologous segments in the presumed allopolyploid (Gill et al., 2009) (Fig. 1B, Table I).
Of the nine intervals represented in both G. max cv. Williams82 (Gmw) H1 and Gmw H2, seven contain NB-LRR ORFs in H1 versus five intervals in H2. However, only four intervals contain NB-LRR ORFs in both Gmw H1 and Gmw H2. If duplications are at least partially responsible for this pattern, there must have also been translocations to account for the presence of homoeologue-specific NB-LRR ORF locations.
We have previously shown that, contrary to our expectations, most of the low-copy, non NB-LRR genes in this region appear to be expressed in both G. max homoeologues, despite the fact that the target region in H2 is found in the pericentromeric region of chromosome 15 (Innes et al., 2008). To assess the expression pattern of the NB-LRR family in the pericentromeric H2 region, we used the G. max EST-dataset to estimate what proportion of the NB-LRR genes in Gmw H1 and Gmw H2 are expressed (Table II; Supplemental Table S1). Unlike the low-copy genes (Innes et al., 2008)  ORFs having matching ESTs (17%). This contrasts with 12 out of 15 ORFs (80%) having EST-support in Gmw H1, which is not in a pericentromeric region.
We further investigated the fate of the NB-LRR genes in Gmw H2 by partitioning the family into ORFs that appear to encode intact NB-LRR genes versus those encoding disrupted and/or truncated genes (Table II). Consistent with the EST-analysis, even though numerous NB-LRR ORFs are distributed across the target region in Gmw H2, few appear to correspond to intact, full-length genes (Table II; Supplemental Table S1). In fact, of the 18 predicted NB-LRR ORFs within Gmw H2, only four appear to encode full-length genes with no obvious deletions of key domains or motifs (22% of the total). This compares to 13 out of the 15 NB-LRR ORFs in Gmw H1 appearing to be full-length and intact (87%). There is substantial, but not complete, overlap between the sets of paralogues that appear to be intact with those having EST-support (Supplemental Table S1).
Interestingly, of the 30 predicted NB-LRR ORFs found in the single orthologous region in Phaseolus, only 12 (40%) appear to be intact (Table II). In reality, the actual number of intact NB-LRRs in Phaseolus is probably somewhat larger because gaps in the BAC contig for this species may harbor additional paralogues (two NB-LRR genes at the end of BAC contigs were not included in these totals as we could not determine whether they were intact). Given that the total number of intact NB-LRR genes in both Gmw H1 and Gmw H2 combined is 17, it is apparent that the polyploidization event in the Glycine lineage does not correlate with a large, stable increase in the number of intact NB-LRR genes derived from the ancestral Rpg1b region over that seen in Phaseolus. As described later, this observation was investigated further by assessing the relative importance of polyploidy, versus rates of local duplications and deletions, in the evolution of the NB-LRR family in Glycine and Phaseolus.

Domains and Less Common in the NB-ARC
NB-LRR gene clusters are thought to undergo frequent unequal crossing over events, leading to expansion and contraction of copy number within a cluster, and likely giving rise to recombinant NB-LRR genes with novel specificities for pathogen recognition (Collins et al., 1999;McDowell and Simon, 2008). In addition, gene conversion events among NB-LRR genes may also give rise to new specificities by shuffling sub-regions of NB-LRR sequence (Dodds et al., 2001;Mondragon-Palomino and Gaut, 2005). Both unequal crossing over and gene conversion create chimeric sequences that cannot be meaningfully analyzed in a tree- building framework, such as traditional phylogenetic analysis (Huson and Bryant, 2006). To assess the frequency of recombination among NB-LRR genes in our dataset, we aligned them using a combination of automated and manual alignment tools (see Methods), and then detected likely recombination break points using several methods implemented in Recombination Detection Program v.3.15 (Martin et al., 2005). As expected, these analyses uncovered multiple NB-LRR genes with strong evidence of recombination. Plotting the positions of the recombination breakpoints relative to the conserved domains of NB-LRR genes revealed a strong bias against recombination within the NB-ARC domain (Fig. 2), with only three of 38 (8%) defined breakpoints falling within the NB-ARC, although this stretch represents about 25% of the total sequence length. In contrast the LRR region, which is generally considered a zone of high recombination, contains 68% of the break points while only representing approximately 58% of the gene sequence. Interestingly, recombination events are also over-represented in the CC domain, which contains 24% of the breakpoints while only representing 17% of the gene. However, no defined breakpoints were detected in the first 300nt of the CC, suggesting that recombination is poorly tolerated in this region.
The paucity of recombination breakpoints within the NB-ARC domain raised the question of whether reduced DNA polymorphism in this domain might be limiting our ability to detect such events. We therefore determined the total DNA polymorphism levels in the CC, NB-ARC and LRR domains separately. Despite higher amino acid sequence conservation in the NB-ARC domain (Supplemental Table S2), compared to the CC and LRR domains, the level of nucleotide polymorphism per site (Theta) was only slightly lower in the NB-ARC domain. Overall, we identified 539 polymorphic sites in the NB-ARC domain compared to only 279 in the CC domain, indicating that our failure to detect recombination breakpoints within the NB-ARC domain was not due to a lack of polymorphism. We thus conclude that the lack of recombination breakpoints within the NB-ARC domain is real, and likely reflects functional constraints.
The length of sequence exchange events detected was highly variable, ranging from 71 nt (event 23) to 1166 nt (event 6) for events in which both breakpoints could be defined (Supplemental Table S3). It should be noted that in many cases the position of the breakpoints could not be accurately determined (indicated by asterisks in Fig. 2A), so some exchange events may be even longer. Interestingly, most of the events are confined predominantly to the CC or LRR domains with several possibly extending into the flanking 5' and 3' regions. Only one event extends completely across the NB-ARC ( Fig. 2A, event 6).
Also striking is the absence of events that extend from the LRRs into the NB-ARC domain. Overall, these observations suggest that events leading to chimeric NB-ARC domains are poorly tolerated.
For a subset of the recombination events shown in Figure 2, RDP was able to identify both of the likely parent sequences and their genomic locations (Table III). In all such cases these sequences were within the same physical region (Table III). Events involving sequence transfer within Glycine H1, Glycine H2, and the single orthologous region in Phaseolus, were detected (9, 1 and 2 events, respectively), but no unambiguous exchanges between the Glycine homoeologous regions were identified (Table III; Supplemental Table S3). So while a similar number of intact NB-LRR paralogues have been maintained in the G. max and Phaseolus accessions studied, in Glycine the paralogues are distributed between two recombinationally isolated clusters that are free to evolve independently.

Duplications
Using the stringent criteria used to detect recombinant paralogues described here, most, but not all, paralogues were found to be recombinant in at least one of their three domains. For example, in Gmw H1 only two out of the 12 NB-LRR genes included in the analyses contain no detectable recombination events. Both of the non-recombinant genes (Gmw21f22_7 and Gmw221b6_15) are intact, expressed and contain NB-ARC domains that are relatively isolated in the NB-ARC phylogenetic tree (Fig. 3, described below). Their sequence divergence likely accounts for their lack of recombination with other genes in this region.

Phylogenetic Analysis Reveals Homoeologue-Specific Expansions of NB-ARC Clades
To avoid the errors introduced by chimeric sequences, we focused our phylogenetic analyses on the NB-ARC region, which displayed the lowest levels of sequence exchange.
The recombination test was repeated on this region alone and any recombined NB-LRR sequences, or sequences missing most or all of this region, were excluded, leaving 72 of 93 sequences. These 72 sequences were then subjected to Bayesian analysis to construct a phylogenetic tree (Fig. 3 striking in Phaseolus, where all but one of the paralogues are found in three well supported clades containing sequences only from this species. Consistent with rampant species-specific local duplications, we observed only a single orthologous gene pair, even when comparing paralogues from G. max and G. tomentella, while multiple groups of likely co-orthologous genes are present (Fig. 3). Taken together, these observations suggest that a rapid turn-over (i.e., local duplications and deletions) of paralogues within each species has resulted in groups of co-orthologous genes while still preserving a limited number of ancestral NB-ARC lineages over a substantial evolutionary time period (i.e., prior to the Phaseolus / Glycine split).
Also striking is the apparent partitioning of ancestral NB-ARC lineages between the two Glycine homoeologous regions. As discussed in the section below, this partitioning of lineages between the homoeologous regions points to deletions subsequent to the polyploidization event (or more accurately, subsequent to the separation of the parental lineages as recent evidence indicates this was likely an allopolyploid event (Gill et al., 2009) eliminating the majority of NB-LRR gene duplicates generated by the whole genome duplication.

Lineages Between the Glycine Homoeologous Regions
To more accurately define how many loci were present in the common ancestor of Glycine and Phaseolus, we re-examined the NB-ARC phylogeny to look for putatively orthologous groups of sequences. The number of these groups indicates how many loci (at a minimum) would have been present prior to divergence of Glycine and Phaseolus, thereby providing an indication of the diversity that existed prior to this speciation event, which occurred approximately 19 mya (Lavin et al., 2005). We used the program GeneTree (Page, 1998) to reconcile the species and gene trees, while minimizing gene duplications and losses required to account for the observed data.
Using the consensus Bayesian tree (Fig. 3) as input for GeneTree, an estimate of 10 NB-ARC lineages in this genomic location in the most recent common ancestor (MRCA) of Phaseolus and Glycine was obtained (Supplemental Fig. S1). However, when weakly supported nodes were resolved in favor of fewer duplications and losses (the optimized tree), only nine loci were predicted in the MRCA of these taxa (Fig. 4).
Interestingly, only four of the nine ancestral NB-ARC lineages have been retained in Glycine H2, whereas seven have been retained in H1. Only two clades are represented in both www.plantphysiol.org on August 27, 2017 -Published by Downloaded from Copyright © 2012 American Society of Plant Biologists. All rights reserved. homoeologues ( Fig. 4, clades e and i). The partitioning of clades between the homoeologues is even more striking when only intact paralogues are considered. Neither the single Gmw H1 paralogue (Gmw21f22_26) found in clade e, nor the single Gmw H2 paralogue found in clade i (GmwGap1_106) are intact, or have EST-support, so it appears that no clade has functional representatives in both homoeologues (Supplemental Table S1). Further examination of the data reveals that all four of the intact paralogues in Gmw H2 belong to a single clade (Table I, clade e), whereas Gmw H1 contains intact paralogues from six different clades (Table I, clades b, c, f, g, h and i)(Supplemental Table S1). Thus, while the ancestral clades have been partitioned between the two homoeologous regions, all nine clades have been retained in Glycine, seven of them represented by at least one intact paralogue in the accessions studied here.
In contrast, all of the paralogues from the single orthologous region in Phaseolus are found within three ancestral clades (Fig. 4, Table I). However, caution must be used when interpreting this observation as our sequence coverage in Phaseolus is incomplete and additional clades may be represented by paralogues present in contig gaps.
As noted earlier, although sequence exchanges within the NB-ARC domain are rare, events within the CC and LRR domains are more common. We find that sequence exchanges are not restricted to transfers between paralogues that share the same NB-ARC ancestral lineage. Although two events were detected involving parental paralogues sharing the same class of NB-ARC, eight events occurred between paralogues containing NB-ARCs belonging to different lineages (Table III; Supplemental Table S3). So while several distinct lineages of NB-ARC domains have been maintained since before Glycine and Phaseolus diverged (this likely facilitated by the rarity of inter-paralogue recombination involving this domain), sequence exchanges within the CC and LRR domains are still occurring between paralogues containing different NB-ARC lineages. As a result, gene trees based on just the NB-ARC domain do not necessarily reflect the relatedness of the CC and LRR domains from the same genes.

Duplications and Deletions of NB-LRR Genes in the Rpg1b Region are Ongoing
The GeneTree analysis also allowed us to estimate the number of duplications and deletions that have occurred in the different orthologous and homoeologous segments derived from the ancestral Rpg1b region. Using the optimized tree, eight duplications of NB-LRR loci are suggested to have occurred in the Phaseolus+Glycine lineage after its divergence from the Medicago lineage, but before these two genera diverged, creating the nine labeled clades shown in Figure 4. In contrast, at least 53 duplications (including both local duplications and those resulting from polyploidy) are suggested to have occurred in total in the Phaseolus and Glycine lineages after their divergence (Fig. 4). However, the number of loci and therefore duplications that have occurred prior to the Phaseolus+Glycine divergence from Medicago is probably underestimated due to losses that may occur without leaving any obvious trace. Clearly duplications and deletions of NB-LRR loci have been ongoing throughout the history of these legumes.
In all, the GeneTree analysis indicates that at least 25 duplications occurred in the G. max lineage subsequent to the split from Phaseolus. Of these, 10 duplications were the result of the Glycine-specific polyploidization event, whereas 15 were "local" duplications (i.e., occurred within one of the homoeologous regions examined here or in the ancestral sequence after divergence from Phaseolus but prior to polyploidization) (Fig. 4). It is worth noting that at least one of these apparent local duplications likely reflects conversion of a pre-existing paralogue rather than a whole gene duplication, as this event (Gmw52d1_8, Fig. 2A, event 6) spans the entire NB-ARC region. If only intact paralogues are considered, there is no evidence that any of the nine ancestral lineages have been retained in both G. max homoeologous regions following the polyplidization event (although the possibility cannot be excluded based on analysis of only two accessions). So although polyploidization has been responsible for ~ 40% of the NB-LRR duplication events that have occurred in the G. max lineage, subsequent loss of lineages has prevented the whole genome duplication from driving a sustained increase in family size.
Local duplications that have occurred subsequent to polyploidization are observed in both the Glycine homoeologues and have occurred at a similar rate in the two locations. This is an interesting observation as the target region in H2 is found in a pericentromeric location known to exhibit suppressed recombination (Innes et al., 2008;Schmutz et al., 2010).
Recombination would be required for unequal exchange, one mechanism proposed to generate local duplications in NB-LRR clusters (Leister, 2004). Considering the G. max sequences alone, six duplications among H2 sequences were found (all in clade e) versus eight duplications among H1 sequences (three each in clades c and f and two in clade i) (Fig.   4). However, as noted earlier, in contrast to the situation observed in Glycine H1, few of the paralogues generated by the local duplications in H2 have remained expressed or intact (Table II) Interestingly, local duplication rates associated with specific NB-ARC lineages in Phaseolus and Glycine appear to have differentiated following speciation. P. vulgaris NB-LRR genes within the target region have duplicated at least 22 times since divergence (the actual number of duplications may be higher due to unobserved losses, and unsampled paralogues in contig gaps), while also having been lost at least six times. Excluding duplications caused by polyploidy, G. max NB-LRRs have duplicated only 15 times. When specific NB-ARC lineages are considered the difference is even more striking (e.g. Fig. 4, clades c and i).
In conclusion, although polyploidization accounts for a significant proportion of the NB-LRR duplications that have occurred during the evolution of the Glycine lineage, subsequent deletion of most the duplicated NB-LRR lineages has ensured that it has not driven a large, sustained increase in family size over that seen in Phaseolus. Instead, the genome duplication in Glycine has partitioned the ancestral NB-ARC families between two genomic locations that appear to be recombinationally isolated from one another. ARC domain (Fig. 6). To gain insights into the possible functional significance of the codons under positive selection, the tertiary structure of the NB-ARC domain from a representative paralogue (Gmw52d1-8) was modeled. Interestingly, analysis of the predicted structure indicates that the majority of the sites under positive selection are located on the surface of the folded protein, consistent with a role in interactions with the CC or LRR domains, or with other proteins (Fig. 7A; Supplemental Movie S1).

Positive Selection is Acting on Predicted Solvent-exposed Residues in the CC, NB-ARC and LRR domains
To identify codons under selection within the CC and LRR domains, we partitioned the master, full-length NB-LRR ORF alignment containing the entire set of NB-LRR sequences from this study into three regions corresponding to the CC, NB-ARC, and LRR domains. The parts of the alignment corresponding to the CC and LRR domains were then screened for recombination events individually using RDP 3.44 and recombinant sequences removed. The resulting alignments were then subjected to SLAC, FEL and REL analyses to detect sites under selection. Numerous sites in both the CC and LRR domains were found to be under selection, although the overall trend was for the strength of selection, both positive and negative, to be stronger within the LRRs (Fig. 5A).
To investigate the functional significance of the sites under selection within the CC domain, the COILs program was used to screen the region for the presence of coiled-coils.
The software predicted two intervals encoding coiled-coils when a consensus sequence from the N-terminal alignment was used as the input (Supplemental Fig. S2). As might be expected, several of the predicted locations for conserved hydrophobic residues within the CC heptad repeats are predicted to be under purifying (negative) selection (Fig. 6A). Such purifying selection likely reflects a requirement of the CC domain to fold into a coiled-coil structure, similar to that recently described for the CC domain of MLA10, which forms a homodimer (Maekawa et al., 2011). We thus used the Mla10 CC domain as a template to create a structural model for the Rpg1b CC domain ( The LRR regions of NB-LRR genes have been implicated in recognition specificity and previous studies have reported a bias for the predicted solvent-exposed residues within these regions to be under positive selection (Bittner-Eddy et al., 2000;Mondragon-Palomino et al., 2002). We therefore investigated whether positive selection may also be driving divergence within the solvent exposed faces of the NB-LRR genes from the Rpg1b region. A consensus sequence was derived from the alignment of non-recombinant sequences corresponding to the LRR region and putative repeat units identified and aligned. As can be seen in Fig. 6C, a high percentage of the residues under positive selection within the LRR domain are found within the predicted solvent exposed face ("x" residues printed in blue within the xxLxLxx consensus).

DISCUSSION
Whole genome duplication should initially result in a doubling of gene number in a polyploid relative to its diploid progenitor(s). This is certainly true of an autopolyploid; the prediction for an allopolyploid would be the sum of the two progenitors. In this study we compared the evolution of an NB-LRR gene cluster in two homoeologous segments found in Glycine with its evolution in the single orthologous region found in the diploid P. vulgaris, a stand-in for the ancestor(s) of Glycine. We found that while NB-LRRs have been retained in both Glycine homoeologous regions, this does not result in a large stable increase in functional paralogue number over that seen in P. vulgaris (17 full-length paralogues in G. max cv. W82 H1 + H2 versus at least 12 paralogues in Phaseolus). Instead, the polyploidy event in Glycine most likely has resulted in the partitioning of ancestral NB-ARC lineages between two physically and recombinationally isolated clusters that are now evolving independently.
These observations are consistent with previous whole-genome studies of the NB-LRR family in A. thaliana, cotton, rice and soybean that indicate that, at least in these species, polyploid events have not driven a stable increase in gene family size (Cannon et al., 2004;Nobuta et al., 2005;Schmutz et al., 2010;Zhang et al., 2010;Zhang et al., 2011). In only one of these did phylogenetic analyses support the hypothesis that homoeologous NB-LRR lineages were retained in both duplicated segments (Nobuta et al., 2005).
Together, these observations suggest that there is no selective advantage to maintaining a doubled number of NB-LRR genes following polyploidization, and thus wholegenome duplications are not a key mechanism determining family size. Furthermore, NB-LRR family size appears to be highly variable within a species, with the estimated gene The lack of conservation of NB-LRR genes between the Rpg1 region and its homoeologous region in soybean stands in striking contrast to the low copy genes that are interspersed throughout the Rpg1 region. We previously determined that 77% of single copy genes in the Rpg1 region have been retained in soybean following the 13 mya polyploidy event (Innes et al. 2008). The retention of these low copy genes in syntenic order between the two soybean homoeologues and Phaseolus (Figure 1) indicates that unequal crossing-over between NB-LRR clusters in the Rpg1 region has not occurred, as this would have resulted in deletion or duplication of intervening low copy genes. Despite this lack of unequal cross-over events, we detected sequence exchange between NB-LRR genes in separate clusters. We infer from this observation that gene conversion between NB-LRR genes is more readily tolerated than unequal crossovers, presumably due to a need to retain the low copy genes. It is also notable that many NB-LRR clusters have been entirely lost in homoeologue 2 (see intervals C, E, and F in Fig. 1), while synteny has been retained, suggesting that NB-LRRs can be lost by mechanisms other than unequal crossing over.
The large intraspecies variation in NB-LRR copy number also implies that in the absence of pathogen pressure there may be rapid loss of NB-LRR genes. Several lines of evidence suggest that a large increase in NB-LRR paralogue number following a polyploid event could impose a significant cost on the host plant. For example, in one landmark study, transgenic expression of the Arabidopsis thaliana RPM1 NB-LRR gene in an ecotype that normally does not express this gene resulted in a 9% decrease in seed number (Tian et al., 2003). It should be noted, however, that as A. thaliana expresses more than 150 NB-LRR genes, this cost cannot be typical of all R genes and perhaps reflects auto-activation of RPM1 in a novel genetic background. An even more dramatic cost could result from inappropriate activation of NB-LRRs following the combination of genomes by an allopolyploid event, the likely origin of the Glycine genome (Gill et al., 2009). Supporting this hypothesis, Bomblies and Weigel (2007) observed that in 2% of A. thaliana intraspecific crosses, hybrid-necrosis was triggered in the F1 generation. In at least one instance this phenotype was shown to be dependent on an epistatic interaction between an NB-LRR gene and a second, unlinked gene.
One explanation for these observations could be provided by the "Guard Model", which proposes that at least some NB-LRRs monitor/guard the plant proteins targeted by pathogen virulence factors (van der Biezen and Jones, 1998;Dangl and Jones, 2001;Innes, 2004). The NB-LRRs and the proteins they monitor presumably coevolve and it is possible that when NB-LRRs are introduced into a novel genetic background some will auto-activate in response to inappropriate alleles of the guarded proteins.
As mentioned above, our data indicate a high rate of "birth and death" of NB-LRR paralogues as the result of local duplications and deletions ("local" defined in our study as occurring within one of the ~ 1 Mb homoeologous or orthologous regions analyzed). For example, eight local duplications were identified in G. max H1 and six in H2. The duplications in Glycine H2 are especially interesting as this region resides in a pericentromeric location that is strongly suppressed for recombination and therefore unlikely to experience unequal exchange events (Innes et al., 2008;Schmutz et al., 2010). Strikingly, in P. vulgaris, the rate of local duplications appears to be significantly higher than in either of the Glycine homoeologues, with 22 duplications detected within the available sequence. This increased rate of local duplications presumably reflects an increased rate of unequal crossing over that can lead to expansion of gene clusters. Consistent with this hypothesis, this NB-LRR cluster in P. vulgaris is located in a subtelomeric region, which are known to display thought that the A. thaliana lineage includes at least two polyploidization events (Bowers et al. 2003) so it is clear that the vast majority of NB-LRRs duplicated by these events have subsequently been lost. From these data, the authors concluded that the rate of local duplications and losses, not polyploidy, has been the key factor in determining NB-LRR family size. Our results echo these findings for a more recent polyploidy event. Notably, our identification of numerous NB-LRR pseudogenes at the tips of the phylogenetic tree suggests that gene loss is ongoing and can occur shortly after duplication. These pseudogenes may provide some selective advantage, however, in functioning as a sequence reservoir for gene conversion events that could give rise to new specificities in nearby functional NB-LRR genes.
A second possible beneficial role for polyploid events in NB-LRR evolution could be in creating recombinationally isolated NB-LRR clusters (Baumgarten et al., 2003;Leister, 2004). Following a polyploid event, the number of NB-LRR genes/clusters will initially be increased (doubled in the case of an autopolyploid event). Subsequent deletions of the duplicated NB-LRR paralogues may then result in a return to the original NB-LRR family size. However, for the minority of clusters surviving in both duplicated segments, partitioning of the remaining paralogues between the homoeologues has the potential to distribute genes between an increased number of genomic locations (as compared to the pre-polyploid state).
Providing that sequence exchanges between unlinked NB-LRR loci are rare, the duplicated loci should be free to evolve independently, potentially permitting maintenance of greater NB-LRR diversity in the genome. Furthermore, this could provide a mechanism by which R genes encoding novel recognition specificities are separated from closely related paralogues and so can be protected from disruption by the recombination events that are often common within complex R gene loci. Our observation that ancestral NB-ARC lineages from the Rpg1b cluster have been partitioned between the two homoeologues segments supports this model.
In support of the above model, we found that while interparalogue sequence exchange has occurred within each of the two Glycine homoeologous regions, no convincing interhomoeologue events could be detected. Numerous previous studies have identified sequence exchanges between linked paralogues within NB-LRR clusters, and it has been reported that the frequency of events correlates with physical distance (negatively), sequence similarity, gene orientation and recombination rate (Mondragon-Palomino and Gaut, 2005). It is thought that such recombination may be important for shuffling accumulated sequence diversity within the cluster to facilitate the evolution of new recognition specificities. It is interesting that we observe recombination tracts in Glycine H2 paralogues, as the target region in www.plantphysiol.org on August 27, 2017 -Published by Downloaded from Copyright © 2012 American Society of Plant Biologists. All rights reserved. homoeologue 2 is located in a pericentromeric region that displays suppressed recombination (Innes et al., 2008;Schmutz et al., 2010). However, this apparent paradox can be explained by the observation that while crossovers are suppressed within pericentromeric regions, gene conversion is not (Shi et al., 2010;Talbert and Henikoff, 2010).
Although physical proximity is a key factor modulating rates of sequence exchange between NB-LRR genes, rare recombination events have been detected between paralogues found in different NB-LRR clusters (Kuang et al., 2008), and even between genes located on different chromosomes (Baumgarten et al., 2003;Kuang et al., 2005;Mondragon-Palomino and Gaut, 2005). It has been noted, however, that at least some of these apparent interchromosomal events could reflect recombination between linked genes prior to polyploidy and subsequent losses of parental sequences (Baumgarten et al., 2003;Kuang et al., 2005).
Our failure to detect inter-homoeologue exchanges between NB-LRR paralogues in Glycine H1 and H2 could reflect the high density of retrotransposons in, and associated expansion of, Glycine H2 which might be expected to inhibit inter-homoeologue pairing (Innes et al., 2008;Wawrzynski et al., 2008;Wang and Paterson, 2011).
Significantly, we observed exchanges between paralogues that contain NB-ARC domains belonging to different ancestral lineages. This observation is in contrast to a previous report that, at least in some clusters, exchanges only occur between NB-LRRs that are highly similar in sequence, irrespective of physical proximity, permitting the independent evolution of subfamilies of NB-LRRs within a single cluster (Kuang et al., 2005). Our finding of interlineage exchange among NB-LRR genes highlights the need to exclude recombinant sequences from phylogenetic analyses as such sequences will make it impossible to construct a tree that reflects ancestral relationships. Indeed, many of the genes included in the tree shown in Fig. 3 contain recombination sites outside the NB-ARC domain, thus trees based on just the CC domain or LRR domain of this gene set would look quite different.
In addition to a rapid rate of paralogue turnover and gene conversion, we also find evidence that positive selection has helped drive evolution of the Rpg1b-associated NB-LRR cluster. Strong diversifying selection provides one possible explanation for the retention of paralogue diversity in NB-LRR clusters and has been previously implicated in NB-LRR evolution in plants and the mammalian MHC cluster (Hughes andYeager, 1998, 1998;Mondragon-Palomino et al., 2002;Zhang et al., 2011). As has been observed for other NB- predicted solvent exposed face of the LRR domain. These observations are consistent with the model that the solvent exposed residues interact with pathogen derived ligands or other components of the recognition complex. Evidence to support this hypothesis has been provided by structural modeling of the interaction between the flax L5 / L6 R proteins with the corresponding Avr proteins from flax rust (Dodds et al., 2006;Wang et al., 2007).
Interestingly, we also identified several sites under positive selection in the NB-ARC domain. Although similar observations have been made before (Mondragon-Palomino et al., 2002;Zhang et al., 2011), this was a somewhat surprising finding as the NB-ARC domain has not been implicated in determining recognition specificity. A possible explanation is provided by the discovery that intramolecular interactions can occur between the NB-ARC, CC and LRR domains (Moffett et al., 2002;Ade et al., 2007). Furthermore, these interactions appear to be a key factor in maintaining the R protein in an inactive state in the absence of a corresponding pathogen-dependent signal. It seems plausible that as the LRRs evolve to acquire the ability to detect new pathogen signals, the NB-ARC domain must co-evolve to maintain the required intramolecular interactions, perhaps explaining the presence of sites under positive selection in both R protein domains. Consistent with this hypothesis, domain swaps between the tomato Mi1.1 and Mi1.2 genes, which are 91% identical, generated an auto-active allele that triggered cell death (Hwang and Williamson, 2003). A similar finding has also been described for the potato Rx gene (Rairdan and Moffett, 2006). Modeling of the tertiary structure of the Rpg1b NB-ARC domain suggests that the majority of residues under positive selection are located on the surface of the folded protein where they would be available to interact with other domains (Figure 7).
We also identified several sites under positive selection within the CC domain.
Mapping of these residues onto a predicted tertiary structure of an Rpg1b homodimer revealed that all sites were in the α1 helix facing outward, and were on the side opposite to the conserved EDVID motif (Fig. 7B) Maekawa et al., 2011;Shen et al., 2007). If this face of the CC homodimer does indeed interact with other host proteins, our finding that positive selection is occurring on this face implies that such protein:protein interactions are rapidly evolving. The Arabidopsis RIN4 protein is modified by at least three different bacterial effector proteins, and it is these modifications that appear to activate the RPM1 or RPS2 NB-LRR proteins (Axtell and Staskawicz, 2003;Chung et al. 2011). It is thus plausible that evolution of new ways to modify RIN4 by pathogens selects for changes in RIN4:CC interactions.
In summary, we propose a model for the evolution of the Rpg1b cluster in Glycine in which a relatively rapid deletion of duplicated NB-LRR paralogues followed polyploidization, but in a manner that resulted in partitioning of most of the ancestral lineages between two sister clusters. The H2 cluster became subsumed within a pericentromeric region. Low levels of unequal crossing over in this region removed a key driver of NB-LRR gene births at the same time that transposon insertions led to gene deaths; the associated accumulation of retrotransposons in H2 may have also suppressed inter-homoeologue sequence exchanges with H1. Subsequently the two sister clusters evolved independently with local duplications / deletions of paralogues, sequence exchange, and positive selection, all playing roles in the divergence of the NB-LRR family. While the cluster in Glycine H1 has retained much of the diversity found in the ancestral cluster, only a few, closely related paralogues have survived in H2. In comparison to Glycine, Phaseolus has likely lost diversity in terms of extant NB-LRR lineages in this region, possibly due to an enhanced rate of gene birth and death resulting from unequal crossover events, and frequent gene conversion, both of which will lead to homogenization of NB-LRR sequences in the absence of strong diversifying selection.

DNA Sequence Sources
Our analyses focused on CC-NB-LRR genes localized within a 1-Megabase (Mb) genomic region around the soybean Rpg1b disease resistance gene, located in molecular accession G19833) (Pva), and in the model legume Medicago truncatula Gaertn. (accession A17 Jemalong) (Mt). Assembly of BAC contigs, DNA sequencing, and gene identification have been described previously (Innes et al., 2008). For purposes of discussing homoeologues, we refer to the reference sequence of G. max Williams 82 containing Rpg1b and its corresponding regions in G. max PI96983 and G. tomentella as homoeologue 1 (H1) and the duplicated region that arose from the ≤ 13 mya polyploid event as homoeologue 2 (H2). The majority of the low-copy genes used to define the physical intervals shown in Fig.   1 have been previously analyzed phylogenetically to confirm the predicted orthologous and homoeologous relationships (Innes et al., 2008).

Previously identified G. max fgenesh-predicted open reading frames (ORFs) from
Gmw H1 and Gmw H2 with significant sequence similarity to CC-NB-LRR genes (Innes et al., 2008) were compared to the G. max expressed sequence tag (EST) dataset maintained at NCBI, the G. max Transcript Assembly database at TIGR (http://plantta.jcvi.org/), and the set of predicted soybean genes (Glyma) from the soybean genome project (http://www.phytozome.net/soybean; Schmutz et al. 2010) using the BLASTN algorithm (Altschul et al., 1990). ORFs were recorded as EST-supported when matches were ≥ 98% nt identity over ≥ 100nt and the BLAST score was 1e-10 or better. Matches were further verified by querying the soybean whole genome proteome (gene set Glyma1.0) with the ESTs using BLASTX and confirming that the expected NB-LRR paralogues were the top hits. CC-NB-LRR genes were considered "intact" (full-length) when the predicted ORF was encoded by a single exon, all three major domains (CC, NB-ARC, LRR) were represented, and the P-loop, kin2, GLPL, NBSD_D and MHD motifs within the NB-ARC domain (Albrecht and Takken, 2006) were not obviously deleted.

DNA Sequence Polymorphism Analysis
We estimated polymorphism statistics for each of the CC, NB-ARC, and LRR domains using DNAsp v5 using default parameters (Librado and Rozas, 2009). We also estimated synonymous and nonsynonymous substitution rates, as well as the K a /K s ratio, in DNAsp v5. Most analyses in DNAsp remove columns with gaps and/or missing data. Therefore, we removed sequences from each region that included a large string or gaps or missing data.
After finding a high degree of recombination in the CC and LRR domains, the alignment was trimmed to the NB-ARC domain from the P loop motif to the MHD motif (i.e., from amino acids VGMGG to MHDLL in AY452685). Recombination testing was repeated (as above) and any remaining recombined sequences were also excluded, leaving a subset of 72 out of 93 sequences.
Phylogenetic analyses were performed using MrBayes v. 3.1.2 (Ronquist and Huelsenbeck, 2003). Two runs using ten chains each were analyzed to five million generations, sampling every 10 K generations for each analysis. Trees sampled from the first four million generations were discarded as the burn in and in each case were found to have reached stability by examining the likelihood versus generation plots. Convergence between runs using the remaining 200 trees within an analysis was checked by three methods: a visual inspection of mixing of samples from each run in the aforementioned plot; that the potential scale reduction factor was at or close to one; that the average standard deviation of split frequencies was because of computing restrictions, but failed to better the likelihood score of the simpler models described above and therefore was not pursued.

Gene Tree/Species Tree Reconciliation
GeneTree v.1.3.0 (Page, 1998) was used to plot the position of gene duplications and losses in the course of evolutionary history (phylogeny) and within the framework of assumed relationships of the taxa (i.e., the species tree). This program incorporates gene duplications and losses as well as speciation events for large gene families and attempts to reduce the number of events required to explain the observed data (i.e., extant genes present in the genomes of the sampled species). This program also provided an estimate of the number of NB-LRR loci present prior to the divergence of Phaseolus and Glycine approximately 19 mya (Lavin et al., 2005). We used the species tree shown in Fig. 1A. For simplicity, Glycine max accessions were coded as separate species, but this distinction was ignored for the counts of gene duplications within G. max listed in the Results. The model allowing gene duplications and losses was used.

Selection Analyses
Selection acting on individual codons within the NB-LRR genes was assessed using packages available through the DataMonkey server (http://www.datamonkey.org/) (Pond and Frost, 2005;Pond et al., 2005;Delport et al., 2010). These analyses required alignments of non-recombinant sequences. Because few of the NB-LRR genes from the Rpg1b region are non-recombinant across their entire ORFs, the master alignment of full-length ORFs was divided into three segments representing the CC, NB-ARC and LRR domains that were then analyzed separately.
Selection within the NB-ARC domain was assessed using the set of non-recombinant NB-ARC sequences used for the phylogenetic analyses described above. Using DataMonkey, we determined that the closely related sequence pairs W10N21_4 / P77K19_10 and W52D1_5B / P92I7_20 were non-informative and removed one sequence from each before initiating analyses using the Single Likelihood Ancestor Counting (SLAC), Fixed Effects Likelihood (FEL) and Random Effects Likelihood (REL) methods (Pond and Frost, 2005) implemented on the DataMonkey server and performed using the Bayesian tree presented in To identify sets of non-recombinant sequences for selection analyses within the CC and LRR domains, the sections of the full-length master alignment (93 sequences) corresponding to these domains were screened for recombinant sequences using RDP v. 3.44 and the parameters described earlier (Phylogenetic and Recombinational analyses). All recombinant sequences were then removed. In this manner alignments of non-recombinant sequences containing 53 and 20 sequences were identified for the CC and LRR domains, respectively. These alignments were then used as input to MrBayes (v. 3.1.2) for generating Bayesian trees (nst=6 rates=invgamma, 1000000 generations, samplefreq 100, Nchains=4 Nruns=2). These alignments and phylogenetic trees were then used as inputs for selection analyses using the DataMonkey server as described above.

Prediction of Coiled Coils
A consensus sequence was generated from the alignment of 53 non-recombinant sequences corresponding to the CC domain (selected as described above) such that only specific amino acids present in greater than 50% of the sequences represented at a given position in the alignment were retained. This consensus was then used as input for the COILS server (http://www.ch.embnet.org/software/COILS_form.html) (Lupas et al., 1991) using the MTIDK matrix. The analysis was repeated with and without weighting (2.5 fold weighting for positions a and d of the heptad repeat) to facilitate the identification of false positives.

Comparative Modeling of the Rpg1b CC and NB-ARC Tertiary Structures
To examine the position of positively-selected sites within the NB-ARC domain in the context of tertiary structure, the Rpg1b Williams 82 allele (referred to as paralogue W52d1-8 elsewhere in this manuscript) (Ashfield et al., 2004) and APAF1 (Uniprot sequence O14727) amino acid sequences were aligned using ClustalX (Larkin et al., 2007) and the alignment trimmed to positions 108-450 in APAF1. Secondary structure was predicted for both proteins using the PSIPRED Protein Structure Prediction Server (http://bioinf.cs.ucl.ac.uk/psipred/) (Jones, 1999;Bryson et al., 2005)  (http://www.cgl.ucsf.edu/chimera/) (Pettersen et al., 2004) and using APAF1 as the known structure for comparison. Structures were visualized and manipulated using UCSF Chimera.
Modeling of the Rpg1b CC domain was accomplished using a similar approach, but using the known structure of the MLA10 CC domain as reference (PDB code 3QFL;Maekawa et al., 2011). The Rpg1b CC monomer structure was first modeled and the dimer structure subsequently inferred by applying the symmetry matrix present in the MLA10 PDB file. The Rpg1b and MLA CC domains display a similar pattern of secondary structures but only share 14.9% AA identity (41% similarity) over the modeled region. While the presence of conserved "EDVID" and heptad repeat regions facilitated the sequence alignment, alternative structures cannot be excluded due to ambiguities in the alignment. The structure shown was the one most similar to the MLA10 structure.

Supplemental Data
The following materials are available in the online version of this article. Table S1. NB-LRR genes characterized in this study. Table S2. Polymorphism statistics for the CC, NB-ARC, and LRR domains. Table S3. Recombination events. Table S4. Selection analysis statistics. Table S5    Williams 82 WGS scaffolds (7x draft assembly) are preceded by "GmwRF" or "GmwGap1".

Supplemental
The remaining sequences are BAC-derived and are labeled with the BAC name and gene number corresponding to those described in Innes et al. (2008). # indicates a G. max or P.
vulgaris NB-ARC sequence from an intact NB-LRR paralogue (i.e. not a psuedogene). Note that sequences from two G. max accessions are included (with prefixes Gmw and Gmp) and putatively allelic pairs should not be confused with local duplications. The sequences printed in gray are derived from GmwH1 but are outside the 1Mb target region.    supported by FEL (p<0.05) the colored residue is printed in normal type, when the prediction is also supported by SLAC (p<0.05) and REL (BF>50) the residue is printed in bold. A specific amino acid is shown in the consensus sequence when it is present in at least 51% of the sequences represented at that location in the alignment (gaps excluded). x indicates all other sites in the alignment. A, CC Domain. Underlined residues are predicted to form coiledcoils (probabilities of >95% and >89% for the 1st and 2nd underlined regions, respectively).
The predicted position of amino acids within each CC heptad repeat is indicated with lowercase letters. B, NB-ARC Domain. Boxed residues indicate previously defined conserved motifs (from N-terminal end: P loop, kin2, GLPL, RNBS-D, MHD). C, LRR Domain. Individual leucine-rich repeats shown on separate lines. The predicted solventwww.plantphysiol.org on August 27, 2017 -Published by Downloaded from Copyright © 2012 American Society of Plant Biologists. All rights reserved. exposed face is boxed. D, Consensus sequence for intracellular LRRs. The region predicted to form part of a solvent-exposed face is underlined.