Chromosome Dosage Analysis in Plants Using Whole Genome Sequencing

[Abstract] Relative chromosome dosage, i.e. , increases or decreases in the number of copies of specific chromosome regions in one sample versus another, can be determined using aligned read-counts from Illumina sequencing (Henry et al. , 2010). The following protocol was used to identify the different classes of aneuploids that result from uniparental genome elimination in Arabidopsis thaliana , including chromosomes that have undergone chromothripsis (Tan et al. , 2015). Uniparental genome elimination results in the production of haploid progeny from crosses to specific strains called “haploid inducers” (Ravi et al. , 2014). On the other hand, chromothripsis, which was first discovered in cancer genomes, is a phenomenon that results in clustered, highly rearranged chromosomes. In plants, chromothripsis has been observed as a result of genome elimination (Tan et al. , 2015). Detecting variation in chromosome dosage has multiple applications beside those linked to genome elimination. For example, a dosage variant population of poplar hybrids was created by gamma-irradiation of pollen grains. Hundreds of dosage lesions, insertions and deletions, were identified using this technique and provide a way to associate loci with the phenotypic consequences observed in this population et al. , 2015).

For help on the meaning of different parameters: bin-by-sam.py -h. Input: Run the script in a directory with the input _aln .sam files. Output: One file with a line per non-overlapping, consecutive bin along each of the reference sequences and two columns for each input .sam file: one indicating the number of reads mapping to each bin and the other indicating the corresponding dosage relative to the control. After running this initial analysis, the obtained read counts can be used as an indication of the appropriate minimum bin size. As a rule of thumb, no less than an average of 100 read counts per bin should be used (see Figures 2 and 3).

Parameters
Required: -o, output file name (for example "-o Dosage_100kb_control2.txt") -s, bin size in bps (for example "-s 100000" for 100 kb bins) Optional: -c, to use a control sample for relative percent coverage calculations, specify the file name here.
If no file is specified, the mean of all samples is used as control value for each bin (Note 1).
-u, to use only samtools flagged unique reads (XT: A: U), in which the read maps uniquely to only one location in the genome.
-m, to specify the maximum number of mapping mismatches allowed for a read to be used. This -b, inserts empty lines between reference sequences in the result table for easier JMP parsing (Do not use if the reference sequence contains more than few major chromosomes or contigs.).
-r, "remove file", a file containing a list of reference sequences to ignore, in the sam header format. There is an included example file Remove-Sample.txt in the archive. This option can be useful if the organelle sequences are included in the genomic sequence for example (Note 6).
-p, ploidy, default is 2 (diploid), this is used as the multiplier in the relative dosage calculation.
-C, coverage only mode, which only outputs the read counts columns for each library, but not the relative dosage columns. This option cannot be used when a control library is specified.

Data analysis
The [sample]/control columns are plotted as an Overlay Plot on JMP for visualization (Figure 1).
Other software platforms with graphing functions such as R can also be used as an alternative to JMP for generating the overlay plots for each (sample)/control column.  6. During the analysis, it is important to compare samples. In our experience, there are regions of the genome that exhibit variability in the dosage plots even in control samples, such as, for example, pericentromeric regions or other repeated regions (Figure 1). This is particularly relevant when mapping reads from one species or variety to a reference sequence from a closely-related yet different species. Additionally, in some species, regions similar to organellar sequences are sometimes included in the genomic reference sequence. Because variable amounts of organellar DNA are often co-purified with the genomic DNA, such regions exhibit wide variation in coverage. These types of variable regions are normally easy to identify as they vary in opposite directions in different samples and should be discarded from analysis. If the reference sequence fasta file contains one or two organellar genome sequences, these can be removed using the -r option, or can be omitted when plotting relative dosage.