Identification of genes escaping X inactivation by allelic expression analysis in a novel hybrid mouse model

X chromosome inactivation (XCI) is a female-specific mechanism that serves to balance gene dosage between the sexes whereby one X chromosome in females is inactivated during early development. Despite this silencing, a small portion of genes escape inactivation and remain expressed from the inactive X (Xi). Little is known about the distribution of escape from XCI in different tissues in vivo and about the mechanisms that control tissue-specific differences. Using a new binomial model in conjunction with a mouse model with identifiable alleles and skewed X inactivation we are able to survey genes that escape XCI in vivo. We show that escape from X inactivation can be a common feature of some genes, whereas others escape in a tissue specific manner. Furthermore, we characterize the chromatin environment of escape genes and show that expression from the Xi correlates with factors associated with open chromatin and that CTCF co-localizes with escape genes. Here, we provide a detailed description of the experimental design and data analysis pipeline we used to assay allele-specific expression and epigenetic characteristics of genes escaping X inactivation. The data is publicly available through the GEO database under ascension numbers GSM1014171, GSE44255, and GSE59779. Interpretation and discussion of these data are included in a previously published study (Berletch et al., 2015) [1].

a b s t r a c t X chromosome inactivation (XCI) is a female-specific mechanism that serves to balance gene dosage between the sexes whereby one X chromosome in females is inactivated during early development. Despite this silencing, a small portion of genes escape inactivation and remain expressed from the inactive X (Xi). Little is known about the distribution of escape from XCI in different tissues in vivo and about the mechanisms that control tissue-specific differences. Using a new binomial model in conjunction with a mouse model with identifiable alleles and skewed X inactivation we are able to survey genes that escape XCI in vivo. We show that escape from X inactivation can be a common feature of some genes, whereas others escape in a tissue specific manner. Furthermore, we characterize the chromatin environment of escape genes and show that expression from the Xi correlates with factors associated with open chromatin and that CTCF co-localizes with escape genes. Here, we provide a detailed description of the experimental design and data analysis pipeline we used to assay allele-specific expression and epigenetic characteristics of genes escaping X inactivation. The data is publicly available through the GEO database under ascension numbers GSM1014171, GSE44255, and GSE59779. Interpretation and discussion of these data are included in a previously published study (Berletch et al., 2015) [1].
&  [4] (2) re-analyzed using our new binomial model. b Analysis was done using two replicates of DNase-seq data deposited by ENCODE.

Value of the data
The data describe tissue specific escape gene profiles, discovery of which could lead to insights into the contribution of escape genes in sex chromosome aneuploidy such as Turner syndrome and in sex differences in general.
The data show that tissue specific escape genes have tissue specific functions hinting at potential roles for X-linked bi-allelic expression.
Analysis of chromatin architecture at escape genes will lead to a better understanding of the mechanisms underlying bi-allelic X-linked gene expression.

Data, experimental design, materials and methods
Random inactivation of one of the X chromosomes in mammalian females takes place during early development and is associated with Xist coating, accumulation of repressive histone modifications and a specific alteration of chromatin architecture. A few genes escape random X inactivation and remain bi-allelicly expressed throughout the life of the organism. This data is associated with the research article focused on identifying escape genes in multiple tissue types in vivo, and on investigating a mechanism that contributes to expression from the inactive X using a novel hybrid mouse model.

in vivo mouse model
Increased Xist expression during development is necessary for normal X inactivation. Several regions of the Xist gene contain repetitive elements which are conserved between mouse and humans. The proximal A-repeat has been shown to be necessary for Xist expression and thus, X inactivation [2]. In order to derive a mouse model with skewed X inactivation, we took advantage of a previously described mouse model in which the proximal A-repeat of Xist (Xist Δ ) was deleted (B6.Cg-Xisto tm5Sado 4, RIKEN [2]. We bred heterozygous C57BL/6 (BL6) females with the mutant Xist (Xist Δ/ þ ) to wild Mus spretus males. The mutant maternal X (X m ) fails to inactivate, thus the resulting F1 females that inherited the mutant X m have identifiable alleles and completely skewed X inactivation in which the Xi is always the paternal spretus X (X p ) chromosome ( Fig. 1). Fig. 1. F1 mouse model with skewed X inactivation and identifiable alleles. C57BL6 females with a Xist deletion [2] were mated with wild-derived Mus spretus males. In the resulting F1 females the BL6 maternal X chromosome that carries the Xist mutation (Xm XistΔ ) cannot become inactivated, leading to complete skewing of X inactivation where the paternal spretus X chromosome (Xp) is inactivated in all tissues. Shown are UCSC genome browser tracks of mRNA SNP read distribution profiles on the active Xm (blue) and the inactive Xp (green) in brain, spleen and ovary at the Xist gene. Note that Xist is expressed from the spretus Xp in all tissues analyzed.
Mice were genotyped to verify mutant status using primers previously described [2]. F1 females heterozygous for the Xist mutation were euthanized at 8 wks of age. Chromatin, DNA and RNA from whole brain, spleen and ovaries from mutant F1 females (X m XistΔ /X p ) were isolated according to the protocols described below. Skewing of X inactivation was verified by Sanger sequencing as previously described [1].

in vitro cell culture
Patski cells were derived from the kidney of an 18dpc F1 female embryo from a cross between a M. spretus male and a C57BL/6 female with an Hprt mutation. Briefly, embryonic kidney cells were selected in media containing hypoxanthine, aminopterin and thymidine (HAT). Only cells with a functional Hprt gene are able to survive in HAT media. Thus, only F1 embryonic kidney cells that contained a Xa from spretus and a Xi with the mutant allele (BL6 Hprt À ) survived. X inactivation in the selected cells (Patski cells) was completely skewed wherein the spretus X was always active [3]. After selection, the Patski cell line was immortalized and was cultured in Dulbecco's Modified Eagle's Medium (DMEM) with 10% FBS and 1% penicillin/streptomycin at 37°C in 5% CO 2 .

RNA extraction and RNA-seq
Total RNA was extracted from homogenized Xist Δ/ þ hybrid tissues or from Patski cell pellets using Qiagen's RNeasy mini kit (Qiagen 74104). For the RNA-seq library of Patski cells, the Illumina mRNAseq preparation kit was used in our previous study [4]. For the RNA-seq libraries of mouse tissues, we used the Illumina TruSeq RNA preparation kit (Illumina RS-930-2001) with some slight modifications. For mRNA isolation, 0.5-4 mg of total RNA was diluted into a total volume of 30 ml with nuclease free water then combined with 30 ml of oligo dT magnetic beads. After incubation at 65°C for 5 min, beads were washed with 120 ml washing buffer and eluted with 30 ml elution buffer. Following a second round of mRNA selection, mRNA was eluted and fragmented at 94°C for 8 min. First-strand cDNA synthesis, second-strand cDNA synthesis and end repair were performed according to the TruSeq instructions except that all clean up steps were done using a Qiagen mini-elute columns (Qiagen 28004). A-tailing was done according to the TruSeq protocol incorporating the version 2 changes and substituting AMpure XP beads with Qiagen mini-elute columns as described above. Further processing, including adapter ligation and fragment enrichment were done according to the TruSeq protocol except that clean up steps were done with Qiagen columns. Libraries were sequenced using an Illumina Genome Analyzer IIx generating 36 bp single-end reads with analyses described below.

Chromatin immunoprecipitation and ChIP-seq
In order to study the chromatin characteristics at escape genes in F1 tissues and Patski cells, we carried out ChIP using specific antibodies for CTCF and RNA PolII-S5p, respectively. F1 tissues were collected and homogenized in a pre-chilled 7 ml glass homogenizer in 5 ml PBS containing 0.1 mM phenylmethylsulfonyl fluoride (PMSF) on ice. The tissue homogenates were transferred to a 15 ml tube and pelleted followed by re-suspension in 10 ml PBS. Samples were cross-linked in 1% formaldehyde for 10 min at room temperature and 125 mM final concentration of glycine was added to quench crosslinking. After washing with ice cold PBS containing 0.1 mM PMSF, cross-linked cells were re-suspended in immuneprecipitation (IP) buffer [5] containing 0.1 mM PMSF and separated into 1 ml aliquots which were either stored at À 80°C or lysed in IP buffer for 10 min on ice. For PolII-S5p ChIP, IP buffer per 1 ml was also supplemented with the following phosphatase inhibitors: 10 ml of 1 M β-glycerophosphate, and 10 ml of 1 M NaF, 1 ml of 100 mM Na 3 VO 4 . Shearing of chromatin was accomplished by sonication (10 rounds of fifteen 1 s pulses at power 6; 2 min rest on ice between rounds) using a Misonix 3000 sonicator. Chromatin concentration was measured using a Nanodrop followed by pre-clearing with 100 ml protein A beads (GE Healthcare Life Sciences) for 1 h at 4°C. All of the steps described above were also done for Patski cells with the exception of the initial homogenization.
For IP, 100 mg of cross-linked pre-cleared chromatin was combined with either 10 mg anti-CTCF (Millipore, 07-729) or 5 mg PolII-S5p (Abcam, ab5131) antibodies and incubated in 1 ml IP buffer over night at 4°C with rotation. Ten percent of cross-linked pre-cleared chromatin (10 mg) was saved as the input control. Mock ChIP (no antibody control) was also simultaneously performed. For collection of DNA-protein-antibody complexes, immunoprecipitated chromatin was incubated with 100 ml protein A beads at 4°C for 2 h with rotation followed by washes in buffers with varying concentrations of salts (low salt: 0.1% SDS, 1% Triton-X-100, 20 mM EDTA, 20 mM Tris-HCl, 150 mM NaCl; high salt: 0.1% SDS, 1% Triton-X-100, 20 mM EDTA, 20 mM Tris-HCl, 500 mM NaCl; LiCl buffer: 250 mM LiCl, 1% NP-40, 1% deoxycholic acid, 1mM EDTA, 10 mM Tris pH 8.1) followed by two TE washes to remove any residual salt. All wash steps were done at 4°C. Complexes were eluted from the beads in elution buffer (1% SDS; 100 mM NaHCO 3 ) 2 times for 20 min each at room temperature. To reverse the crosslinks, eluted complexes were incubated in elution buffer supplemented with 200 mM NaCl overnight at 65°C. ChIP'd DNA was purified using QIAquick PCR purification columns (Qiagen 28106). An aliquot of ChIP and input DNA subjected to PCR for the housekeeping gene β-actin to confirm specificity of the pull-down compared to the no-antibody mock ChIP control.
Purified DNA from CTCF or PolII-S5p ChIP experiments in brain as well as PolII-S5p ChIP experiments in Patski cells were used for library preparation according to the Illumina TruSeq ChIP preparation protocol, with the exception that all AMPure XP bead purification steps were replaced by Qiagen mini-elute columns. Briefly, ChIP DNA was repaired to generate blunt ends followed by the addition of an "A" nucleotide to each end. Adapters were ligated and 300-650 bp size fragments were selected in a 1.5% low-melting agarose gel followed by column purification. After PCR enrichment of purified fragments, completed libraries were sequenced on an Illumina HiSeq 2000 machine. Purified DNA from CTCF ChIP experiments in Patski cells was used for library preparation according to the Illumina genomic DNA preparation protocol and sequenced on an Illumina Genome Analyzer IIx. Sequenced libraries were analyzed as described below.

Mapping and allele-assignment of sequencing reads
To identify reads that map to each parental genome of the F1 mice, we first assembled a "pseudospretus" genome by substituting known SNPs of spretus into the BL6 NCBIv37/mm9 reference genome. Spretus SNPs were obtained from the Sanger Institute (SNP database Nov/2011 version) and from in house analysis [4]. In total, we collected 1,532,011 heterozygous SNPs on the X chromosome and 31,062 of them were located in exonic regions. A more detailed breakdown of SNP location can be found in Table 2.

RNA-seq analysis and identification of escape genes
We calculated diploid gene expression based on all high-quality uniquely mapped reads using cufflinks/v2.0.2 [8] to determine the gene-level RPKM (reads per kb of exon length per million mapped reads) expression values. In addition, we defined SNP-based haploid gene expression from alleles on the Xi or the Xa (Xi-SRPM or Xa-SRPM) to be allele-specific SNP-containing exonic reads per 10 million high-MAPQ uniquely mapped reads.
To identify escape genes and estimate the statistical confidence of escape probability of X-linked genes, we developed the following binomial model. For each gene i on chromosome X, let the number of allele-specific RNA-seq reads mapped to the inactive/active chromosomes be n i0 and n i1 , respectively, and let n i ¼ n i0 þ n i1 . We model n i0 by a binomial distribution: where p i indicates the expected proportion of reads from the Xi. The estimate of the binomial proportion isp i ¼ n i0 n i . Let zα 2 be the 100 1 À α 2 À Á th percentile of N(0,1). The confidence interval of eachp i Inclusion of a mapping bias correction was necessary since our model utilizes distantly related mouse species and prevents analysis of reciprocal crosses [1]. To incorporate the mapping biases toward the BL6 genome over the pseudo-spretus genome into the above model, we define the mapping bias ratio r m for each RNA-seq experiment to be r m ¼ N A0 N A1 , where N A0 and N A1 are the number of allele-specific autosomal reads in the "inactive X containing" genome and the "active X containing" genome, respectively. Considering the mapping biases, the corrected estimate of p i is The upper and lower confidence limits are corrected accordingly.
For each RNA-seq experiment, we defined an X-linked gene as an "escape" gene using the following three criteria: (1) the 99% lower confidence limit (α ¼0.01) of the escape probability p i was greater than zero, (2) the diploid gene expression measured by RPKM was Z1, indicating that the gene was expressed, and (3) the normalized Xi-SRPM was Z 2 but o5 (low-level escape) or Z5 (high-level escape). Biological replicates of RNA-seq experiments were analyzed separately. The Xi-SRPM values are highly correlated between biological replicates (R 2 4 0.9) (Fig. 2).

Sample Protein
Total peaks a Total X-linked peaks b Xa-preferred c Xi-preferred c Both-preferred c Next we identified allele-specific ChIP-seq peaks by the following binomial test. In each diploid ChIP-seq peak region i, we assumed that the numbers of BL6-SNP reads (n i;bl ) and spretus-SNP reads (n i;sp ) within the peak follow a binomial distribution, i.e., n i;bl $ Binomial n i ; p i À Á ; where n i ¼ n i;bl þ n i;sp is the sum of BL6-SNP reads and spretus-SNP reads in peak region i, and p i is the binomial parameter. Since the X chromosome behaves differently from autosomes due to skewed XCI in our systems, we estimated the X chromosome allelic background using all SNP reads in the identified diploid peak regions on the X only. That is, for peaks on the X chromosome, in which N X;bl and N X;sp are the total number of BL6-SNP and spretus-SNP reads in X peaks, respectively. Finally, BL6-preferred ChIP-seq peaks were defined as those that contain significantly more BL6-SNP reads (upper-tail binomial test, p-value o0.05), while spretus-preferred ChIP-seq peaks were identified using the lower-tail binomial test (p-value o 0.05), and both-preferred ChIP-seq peaks were those peaks that were not significant in the two above tests (p-value Z0.25) ( Table 4). In addition, we required the allele-assessable peaks have a minimal SNP read coverage of one allele-specific read (BL6-SNP and spretus-SNP reads) per 10 million mapped reads.