Evaluation of IRX Genes and Conserved Noncoding Elements in a Region on 5p13.3 Linked to Families with Familial Idiopathic Scoliosis and Kyphosis

Because of genetic heterogeneity present in idiopathic scoliosis, we previously defined clinical subsets (a priori) from a sample of families with idiopathic scoliosis to find genes involved with spinal curvature. Previous genome-wide linkage analysis of seven families with at least two individuals with kyphoscoliosis found linkage (P-value = 0.002) in a 3.5-Mb region on 5p13.3 containing only three known genes, IRX1, IRX2, and IRX4. In this study, the exons of IRX1, IRX2, and IRX4, the conserved noncoding elements in the region, and the exons of a nonprotein coding RNA, LOC285577, were sequenced. No functional sequence variants were identified. An intrafamilial test of association found several associated noncoding single nucleotide variants. The strongest association was with rs12517904 (P = 0.00004), located 6.5 kb downstream from IRX1. In one family, the genotypes of nine variants differed from the reference allele in all individuals with kyphoscoliosis, and two of three individuals with scoliosis, but did not differ from the reference allele in all other genotyped individuals. One of these variants, rs117273909, was located in a conserved noncoding region that functions as an enhancer in mice. To test whether the variant allele at rs117273909 had an effect on enhancer activity, zebrafish transgenesis was performed with overlapping fragments of 198 and 687 bp containing either the wild type or the variant allele. Our data suggests that this region acts as a regulatory element; however, its size and target gene(s) need to be identified to determine its role in idiopathic scoliosis.

In an effort to reduce the genetic heterogeneity present in FIS, we previously created subsets from our collection of FIS families based on either the most-likely mode of inheritance or on different clinical criteria (Justice et al. 2003;Miller et al. 2005Miller et al. , 2006. One of our clinically defined subgroups, the kyphoscoliosis (KS) group, consisted of non-Hispanic white families (seven families, 53 individuals) with two or more individuals having a scoliotic curve $ 10°in combination with a thoracic curve $ 40° (Miller et al. 2006). The prevalence of KS in the United States is estimated to be one in 1000 people for lateral and posterior spinal curvatures . 35°, and one in 10,000 for more severe KS, defined by lateral and posterior curvatures . 70° (Bergofsky 1979). Previous linkage analysis of the KS subgroup identified candidate regions on chromosomes 5p15.3, 13q13.3, and 13q32, and analyses of single nucleotide variants (SNVs) narrowed the linkage region on 5p15.3 to about 3.5 Mb (P-value = 0.002) (Miller et al. 2006). The only genes present in this region are IRX1, IRX2, IRX4, and C5orf38, a protein precursor in the promoter region of IRX2, which has coordinated expression with IRX2 and lacks homology to any known protein in the public databases (Wu et al. 2006).
The Iroquois (IRX) genes are members of a highly conserved gene family, the TALE (three-amino acid-loop extension) homeobox genes that encode homeoproteins (Bürglin 1997). Vertebrate IRX genes have been found to play a role in neural tube, heart, and ectoderm patterning (Cavodeassi et al. 2001;Gómez-Skarmeta and Modolell 2002). In humans, the IRX genes are clustered into two groups of three genes believed to be the result of a segmental duplication, because IRX1, IRX2, and IRX4 on 5p are paralogs of IRX3, IRX5, and IRX6 respectively, on chromosome 16q (Bosse et al. 2000;Gómez-Skarmeta and Modolell 2002;Ogura et al. 2001;Peters et al. 2000). This region on 16q was previously found to be linked to FIS (P-value , 0.0005) in a sample of 202 FIS families when the threshold for scoliosis was $ 30° (Miller et al. 2005).
Our previous linkage results suggest the involvement of the IRX gene family and/or transcriptional enhancers within the highly conserved noncoding elements (CNE) surrounding the IRX genes on 5p in the expression of FIS (Miller et al. 2006). The IRX genes are surrounded by a large amount of CNEs, which share sequence similarity between regions on the same chromosome, as well as with regions on 16q (Bejerano et al. 2004;de la Calle-Mustienes et al. 2005;McEwen et al. 2006;Sandelin et al. 2004;Woolfe et al. 2005). Several vertebrate CNE regions have been tested in vivo in zebrafish and mouse, and many of these have been shown to function as tissue-specific enhancers (de la Calle-Mustienes et al. 2005;Goode et al. 2005;McEwen et al. 2006;Nobrega et al. 2003;Woolfe et al. 2005).
In this study, we examined whether the IRX genes or the surrounding regulatory elements are involved in the development of FIS by sequencing the exons and the CNEs located 500 kb upstream and downstream from IRX1, IRX2, and IRX4. Intrafamilial tests of association were used to identify significantly associated SNVs. In addition, a SNV in a highly conserved noncoding region associated with FIS in one family was selected to test for functional significance. Zebrafish transgenesis was carried out in order to determine if the CNE surrounding this SNV, rs117273909, acts as an enhancer in vivo, and if the alternate allele affects the enhancer function.

Subjects
All probands and their relatives were clinically characterized by a single orthopedic surgeon. Written informed consent was obtained for all study participants, in accordance with the Institutional Review Board of the participating institutions.
Primer design for genes and conserved noncoding elements Primers covering the IRX exons genes, the CNEs, and a long intergenic nonprotein RNA in the linkage region, LOC285577, were designed using Primer3 software (http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www. cgi). CNEs were selected for sequencing if they had a LOD score . 100 based on the PhastCons Placental Mammal Conserved Elements, 28-way Multiz Alignment (http://www.genome.ucsc.edu), and 500 kb 6 from IRX1, IRX2, and IRX4 (Supplemental Material, Figure S1 and Table S1). For rs117273909, genotyping of the controls was performed by sequencing. Sequences of primers are available on request.

PCR and sequencing
For mutational analysis of the genes, PCR was performed on DNA from 53 individuals from seven FIS families from the KS clinical subset using n DNA isolated from blood samples of 100 controls consisting of individuals who married into FIS families, and who did not have FIS, were amplified with rs117273909 primers using GeneAmp High Fidelity PCR System (Applied Biosystems, Grand Island, NY). The products were sequenced on an Applied Biosystems / Hitachi 3730 Genetic Analyzer. Sequencing analysis was performed using Sequencing Analysis version 5.2, and Sequence Scanner version 1.0 (both from Applied Biosystems, Grand Island, NY). Alignments of DNA sequences were done with SeqScape (Applied Biosystems, Grand Island, NY), Sequencher (Gene Codes Corporation, Ann Arbor, MI), and CodonCode Alignment software (v 3.7.1.1).

Statistical methods
Data cleaning was carried out on 344 CNE SNVs and 70 insertions/ deletions from 46 individuals. Individuals with a genotype missing rate . 10%, and variants (SNVs, insertions and deletions) with either a Polyphred (v6.11) value , 99, a missing rate . 10%, and/or two or more Mendelian inconsistencies were removed. The Polyphed program identifies heterozygous single nucleotide substitutions, and assigns scores ranging from 99 to 0 to each heterozygous site, where a score of 99 indicates a very good fit and stands for a true positive rate of . 97%. Mendelian inconsistencies were tested using PEDCHECK (O'Connell and Weeks 1998). SNVs that became monomorphic after these steps were removed. Data cleaning reduced the number of individuals for analysis to 38, the number of SNVs to 197, and the number of insertions/deletions to 40. Of the remaining individuals, 22 had scoliosis (of which 14 had KS), 12 were unaffected and four had no curvature information. Tests of Hardy-Weinberg equilibrium (Wigginton et al. 2005) identified one SNV (hg19, chr5:3188028) with a P-value , 0.0001, which was retained for analysis because our sample was ascertained for familial idiopathic scoliosis and removal of SNVs in this situation can remove causative SNVs.
All SNVs were tested for association to the quantitative trait using ASSOC [S.A.G.E., v.6.0.1], a likelihood-based test of association that compares the likelihood of the data in models with and without a marker, and uses the phenotype and genotype information of the entire family. The degree of lateral curvature was analyzed as a quantitative phenotype, and genotypic (a/a, a/A, or A/A) and allelic (presence of minor allele) tests of association were performed.
Cloning of putative regulatory elements into zebrafish enhancer detection vector Site-directed mutagenesis was used to change the wild type rs117273909 C allele to the variant T allele (Bioinnovatise, Rockville, MD). Four constructs (198bp C allele, 198bp T allele, 687bp C allele, and 687bp T allele) were cloned into the zebrafish enhancer detection (ZED) vector (Bessa et al. 2009), which has insulators that prevent false positive expression due to position effects. Tol2 mRNA was synthesized with the mMessage mMachine SP6 kit (Ambion, Grand Island, NY) using NotI linearized pCS2FA-transposase vector as template DNA (Kwan et al. 2007).

Zebrafish transgenesis
Microinjections of all four constructs were performed into one-to two-cell stage zebrafish embryos using 25-45 pg of plasmid DNA mixed with 50 pg of Tol2 mRNA. Embryos were incubated at 28°w ith 0.003% PTU (1-phenyl 2-thiourea) to suppress pigmentation. Embryos with transgene integration were identified by red fluorescent protein (RFP) expression in the skeletal muscle at 48 hr postfertilization (hpf), and green fluorescent protein (GFP) expression was observed in RFP positive embryos from 48 hpf to 5 d post fertilization (dpf). Embryos positive for both RFP and GFP were grown to adulthood. Several germline transmitting founders were identified for each construct, and their progeny were evaluated for patterns of GFP expression from 48 hpf to 5 dpf. Embryos from F2 generation of germline transmitting founders were also evaluated for GFP expression by crossing F1 adult fish with wildtype fish. The only difference in sequence between the C and T allele constructs was at the rs117273909 locus. Screening and imaging of embryos were performed using the Zeiss SteREO Lumar.V12 stereomicroscope with an AxioCam HRC color camera or Zeiss Axio Observer Z1 inverted microscope with an AxioCam MRm black and white camera. Figure 1 Regional allelic association plots of SNVs in CNEs surrounding IRX genes. The position of each SNV is indicated by the circles; diamond indicates the SNV with the most significant association (rs12517904, P-value = 0.00004). The color of the circles is the linkage disequilibrium (r 2 ) between each SNV and rs12517904, indicated by the color scale on the left. Gray circles indicate a lack of linkage disequilibrium between the SNV and rs12517904. Genes in the region are indicated in the lower box. Recombination rates (cm/Mb) are depicted by the purple vertical lines. The plot was created using Locus Zoom (https://statgen.sph. umich.edu/locuszoom).

Data availability
The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.

Variant analysis of IRX genes exons and CNEs
Sequencing of the IRX1, IRX2, and IRX4 exons in 53 individuals did not identify any functional sequence variants. Sequencing of the CNEs surrounding the IRX genes in 46 individuals identified 197 SNVs suitable for analysis, of which 16 were 6 500 kb of IRX4, 70 were 6 500 kb of IRX2, and 160 were 6 500 kb of IRX1, with eight SNVs overlapping the conserved regions of IRX4 and IRX2 and 41 SNVs overlapping the conserved regions of IRX1 and IRX2. Of the 197 SNVs, 23 were novel.
Intrafamilial tests of association between the genotypes of the 197 SNVs and the quantitative trait (scoliotic curvature) resulted in 30 SNVs (Table 1) with nominal significance (P-values , 0.05). Only rs12517904 (P-value = 0.00004, Figure 1) was significant after adjusting for multiple testing (Bonferroni correction P-value = 0.00013). No association was identified with the 40 insertions/deletions.
We searched for variants cosegregating with KS. In one family, the genotypes of nine SNVs differed from the reference allele in all individuals affected with KS, and two out of three individuals with FIS (Table  2); unaffected family members, and all other individuals genotyped for this study, did not differ from the reference allele for any of these nine SNVs. The one family member with FIS that did not differ from the reference allele had a scoliotic curve of 36°, and no hyperkyphosis. Three of these SNVs were located in highly conserved regions, based on whole-genome alignment of vertebrates. Of these three, only rs117273909 (P-value = 0.039 for association analysis using all families genotyped in this study) was found to be conserved in all 96 vertebrate species in which this SNV and the corresponding CNE were present (www.genome.ucsc.edu). Rs117273909 is located in the 841 kb gene desert between IRX1 (413 kb downstream) and IRX2 (431 kb downstream), and thus could function as an enhancer for either IRX1 or IRX2, or for both.
We determined the frequency of the variant T allele of rs117273909 in a Caucasian population by sequencing 100 controls consisting of individuals who married into our FIS families, and who did not have a history of FIS. Of these 100 controls, 90 matched the reference sequence (C), one was heterozygous (C/T), and nine failed to amplify, resulting in a T allele frequency of 0.00515. In the 1000 Genomes Project (www.1000genomes.org), rs117273909 was found to be C/T in 27 of 5008 genotypes (T allele frequency of 0.00539), and heterozygous A/T in three out of 5008 genotypes.
Based on the association with KS (P-value = 0.039), the presence of a nonreference allele cosegregation with KS in a single family, the fact that rs117273909 is located in a noncoding fragment (element_603, hg19 chr5:3182218-3183271,VISTA Enhancer Browser, http://enhancer. lbl.gov), which drives expression in the hindbrain in five out of 10 transgenic mice embryos (Visel et al. 2007), and the highly conserved nature of this SNV, we gave this SNV priority in our effort to determine if this SNV was functional. We then used zebrafish transgenesis to evaluate the effect of a C to a T allele substitution (rs117273909).

Regulatory activity of CNE containing rs117273909
We performed zebrafish transgenesis to determine if changing the allele at rs117273909 from a C (wild type allele) to a T would result in a change in its regulatory activity. Tena et al. (2011) tested a 1172bp fragment encompassing rs117273909, which did not show any regulatory activity in Xenopus, possibly due to the presence of repressors: therefore, we chose to test smaller fragments for regulatory activity. For the zebrafish transgenesis assay, the wild type (C allele) and variant (T allele) versions of a 198 bp fragment (hg19, chr5:3182938-3183135), and an overlapping larger 687 bp fragment (hg19, chr5:3182466-3183152), both selected based on strong conservation across vertebrates, were tested for regulatory activity ( Figure S2). Despite strong RFP expression, no consistent GFP expression patterns were observed in the embryos injected with any of the four constructs ( Figure S3).
Due to the mosaic nature of the transgene expression during transient transgenesis, we generated multiple stable transgenic lines for all four constructs by screening founders for germline transmission. Variable GFP expression patterns, ranging from weak and diffused expression to strong and specific expression, were observed in the F1 and F2 progeny of multiple germline transmitting founders for each of the four constructs (Figure 2, A-P). These data may indicate positional effects of the transgene integration in independent founders for each of the four constructs. The specific GFP expression patterns were observed in pineal gland, pharyngeal arches, and brain. However, there was no association of these expression patterns with any particular construct.

DISCUSSION
Our results suggest that noncoding fragments as small as 198 bp can act as regulatory elements in zebrafish, which is in contrast to a report that a 1172 bp fragment (TA3235, 131:818307-820479 bp, xenTrot2) n  TT  TT  TT  TG  TT  -TG  TG  TG  TG  rs146531224  overlapping our 198 bp and 697 bp fragments did not drive expression in Xenopus (Tena et al. 2011). At present, there are no guidelines regarding the identification of regulatory regions, the size of these regions, or the type of activity present (repressors vs. enhancers). We were not able to determine in this study whether this regulatory region plays a role in the expression of FIS. Although we observed a strong expression trend with one of the two alleles at rs117273909, we did not observe a common pattern of expression among multiple lines, and thus could not rule out position effects. GFP expression occurring in different regions (the pharyngeal arches, pineal gland and brain) could be due to positional effects, something the ZED vector should minimize, or could be the result of the involvement of this regulatory region with a homeobox gene. The expression of our conserved fragment in the pineal gland is of interest, since pinealectomised chickens develop scoliosis (Machida et al. 1993). There appeared to be differences in the level of expression and the timing of GFP expression between the wild type and alternate allele at rs117273909. de la Calle-Mustienes et al. (2005) noticed that DNA sequence differences in zebrafish and mouse CNEs did not result in differences in where expression occurred, but affected the level and/or timing of the transcription. The change in timing and regulation may be due to the sequence degeneration, which interferes with loop formation. Tena et al. (2011) identified a three-dimensional architecture that forms through CCCTC-binding, present in the Irx clusters of mice, zebrafish, and Xenopus, which brings the Irx1/3 and Irx2/5 promoters together, and demonstrated that cis-regulatory elements in the Irx clusters interact with more than one Irx promoter, up to distances of 1.6 Mb. This looping mechanism may help facilitate the delivery of RNA polymerase, transactivators, and transcription factors to the promoter to the right tissue at the correct time.
In summary, tests of association identified several significant SNVs that may lie in regulatory regions that influence gene expression. These SNVs may play a role in the phenotype, but the relevance of each individual SNV can be determined only by in vitro and in vivo assays. It is also possible that unidentified SNVs in highly conserved regulatory regions further upstream from the IRX family may disrupt developmental patterning and be responsible for the variation of the degree of lateral and thoracic curvature in individuals in these families. Future work will focus on completely sequencing the region with the most significant association, as well as performing additional in vivo assays of regulatory regions surrounding the other associated SNVs identified in this study, including rs1251709, and the SNVs in LOC285577, of which little is known to date. Once we have a better understanding of what tissues are involved in the scoliosis phenotype, we will be able to define the regulatory regions using both in vitro and in vivo assays.