Rapid identification of a candidate gene related to fiber strength using a superior chromosome segment substitution line from Gossypium hirsutum × Gossypium barbadense via bulked segregant RNA-sequencing

Gossypium is the most widely cultivated commercial crop producing natural fiber around the world, and fiber strength principally determined during the secondary wall thickening period is a critical trait for fiber quality. Based on the developed BC 5 F 3:5 CSSLs (chromosome segment substitution lines) from G. hisutum CCRI36 × G. barbadense Hai1, the superior MBI9915 was chosen to construct the secondary segregated population BC 7 F 2 with its recurrent parent CCRI36, which was subjected to Bulk segregant RNA-sequencing (BSR-seq) for rapid identification of candidate genes related to fiber strength. Four fiber-transcriptome libraries were separately constructed and sequenced, including two parents (CCRI36 and MBI9915) and two extreme pools at 20 DPA (days post anathesis). Through multiple comparisons, 3742 DEGs (differentially expressed genes) and 3252 DEGs were separately identified between two parents and between two extreme pools, while 536 DEGs were overlapped between parent and extreme pool groups. A total of 831high-probability SNPs (single nucleotide polymorphism) were identified relevant to fiber strength between two extreme pools through allelic-polymorphism comparison in mRNA sequences, and 18 correlated regions with 1981 annotation genes were finally screened by linkage analysis with SNP-index method, of which including only 12 common genes differentially expressed both between two parents and two pools. Interesting, there was one correlated region consistent with the previous study with the same parents on chromosome A07 with 13-14 Mb, and one common DEG ( Gh_A07G0837 ) in the candidate region was identified in both parents and extreme pools, which has been reported to be involved in fiber strength development through regulating reactive oxygen species (ROS) activity. The reliability of BSR-seq results was validated by the quantitative real-time PCR (qRT-PCR) experiments on 5 DEGs at 20 DPA. This study focuses on bulked segregant analysis of the extreme pools from segregation population developed by superior CSSL and its recurrent parent, indicating that BSR-seq can be efficiently applied on rapid identification for candidate genes related to the significant quantitative traits, which provides valuable contributions for comprehension of fiber strength formation in cotton.


Introduction
Cotton is one of the most important commercial crops all over the world, providing the major natural fibers for the textile industry. There are only four existing cultivated species of Gossypium worldwide, including diploid G. herbaceum and G. arboretum (2n = 2 × = 26), and tetraploid G. hirsutum and G. barbadense (2n = 4 × = 52) (Wendel and Albert, 1992; Grover et al., 2008). With the development of modern industry and people's living standard, higher requirements has been put forward for modern textile mills to achieve better spinning performance, which mainly rely on more efficient spinning technologies, such as the high-speed and automated operations (Bradow and Davidonis, 2000).Therefore, how to obtain the high-quality fiber has been the vital goal in cotton breeding at the present stage, meanwhile remaining a challenging task for cotton researchers because of geneticbasis complexities of fiber formation.
As the most widely cultivated cotton species, G. hirsutum and G. barbadense can contribute more than 95% fiber production per year, of which the former harbors the characteristics of high yield and moderate fiber performance, while the latter has the advantages of the premium fiber quality and high resistance of Verticillium wilt, yet subjects to the disadvantage of low yield (Zhang et al., 2014;Shi et al., 2015). Chromosome segment substitution lines (CSSLs) arises to offer new thoughts in adequate utilization and integration of excellent traits derived from Upland cotton and Sea Island cotton, which also laid the theoretical foundation of developing new cotton varieties with high yield, fine quality, and disease resistance.
To start with a hybridization between the recipient and donor parents, the CSSLs are subjected to multi-generation backcrosses with the recurrent parent and subsequently multi-generation self-crosses, combining marker assisted selection (MAS) to ultimately obtain the Near-isogenic lines (NILs). In consideration of reducing the interference of complex genetic background in population, the CSSLs are the ideal materials appropriate for dividing the complex and variable quantitative traits or phenotypes into a single inheritance factor, which have been widely applied to perform quantitative traits loci (QTL) mapping in tomato (Lycopersicon esculentum) (Eshed and Zamir, 1994), rice (Oryza sativa) ( strategies have been applied to identify the genes responsible for fiber length and strength in cotton. Using re-sequencing technology and genome-wide association study (GWAS) strategy on a core collection of upland cotton, some significant candidate genes were identified, including Gh_A10G1256 associated with fiber length, and Gh_A07G1767, Gh_A07G1768, andGh_A07G1769 related to fiberstrength . Through fine mapping of stable QTL and quantitative real-time PCR (qRT-PCR) verification, a LRR-RLK (leucine-rich repeat protein kinase) family protein (Gh_A07G1749) was identified as a candidate gene for fiber-strength QTL qFS07.1 (Fang et al., 2017). Transcriptome analysis on three CSSLs derived from CCRI36 (G. hirsutum) and Hai1 (G. barbadense) during the secondary wall thickening period indicated three differentially expressed genes (DEGs), namely as XLOC_036333 (MNS1), XLOC_029945 (FLA8), and XLOC_075372 (snakin-1), might play the significant roles in regulating fiber strength .
With the combination of bulked segregant analysis (BAS) method and RNA-seq technology, BSR (bulked segregant RNA-seq) can not only identify the DEGs by comparing the two constructed extreme pools, but also screen the SNPs co-isolated with mutant genes by linkage analysis (Liu et al., 2012). Due to saving plenty of manpower, material resources, and shortening the breeding cycle, BSR-seq has attracted more and more attentions in recent years, which has been an efficient strategy for rapid QTL mapping and efficient identification of candidate genes in connection with key traits in plants (Liu et al., 2012;Trick et al., 2012;Wang et al., 2013a;Yates et al., 2014). Through performing BSA-seq on waterlogging sensitive and resistant pools in maize, a candidate gene (GRMZM2G055704) in the key QTL located in umc1619-umc1948onchromosome1 was identified to response waterlogging (Du et al., 2017). In order to screen the genes responsive to early and late flowering, BSR-seq association analysis on the extreme pools in tree peony obtained 291 unigenes, of which 7 DEGs (c42942.graph_c0, c58332.graph_c0, c58361.graph_c0, c57417.graph_c0, c46352.graph_c0, c53143.graph_c0, and c58526.graph_c0) were confirmed to relate with early and late flowering (Hou et al., 2018).
In the present study, the superior CSSL MBI9915 with high yield and excellent fiber quality was chosen to develop the secondary segregation population BC 7 F 2 with CCRI36, which was derived from BC 5 F 3:5 CSSL populations constructed by G.
barbadense Hai1 as the donor parent and G. hirsutum CCRI36 as the recipient and recurrent parent (Li et al., 2017). Due to concentration on the significant traits of fiber strength, we performed BSA-seq analysis on the fiber samples collected from two parents (MBI9915 and CCRI36) and two extreme pools at 20 DPA, respectively.
Plenty of DEGs were identified with comparisons between different samples, which subsequently underwent GO functional enrichment and KEGG pathway analysis.
Through SNP-index association analysis method and qRT-PCR technology, candidate genes were finally screened to control fiber strength. These results provide a rapid and efficient method to identify the candidate genes by BSR-seq, which will facilitate the revelation of molecular mechanism of fiber formation in cotton.

Plant materials
Two parents, the superior CSSL MBI9915 and its recurrent parent CCRI36, as well as their segregation population BC 7 F 2 were planted at the Anyang experimental farm of

RNA Isolation
Regarding the day tagging flowers as 0 DPA, 1 developing bolls of fiber samples were taken from MBI9915 and CCRI36 at 20 DPA were immediately immersed in ice.
Comprehensive consideration on the fiber performances of both BC 7 F 2 and BC 7 F 2:3 , 16 ones with high fiber strength and 16 ones with low fiber strength were separately selected from BC 7 F 2:3 lines at 20 DPA to construct high and low extreme pools. After being dissected from the developing bolls, fiber sampled were quickly frozen into the liquid nitrogen, and subsequently stored at -80℃. RNA extractions were conducted on each plant, and the obtained RNA samples were equally mixed to construct the extreme pools. RNA prep Pure Plant Kit (Tiangen, Beijing, China) was used to extract high-quality RNA samples. The quality and quantity of the RNA were verified using 1%agarose gel and spectrophotometer (Nanodrop 2000, Thermo Scientific, USA).

Library Construction and Transcriptome Sequencing
First, high-quality mRNA was enriched from total RNA using Oligo (dT) magnetic beads, and was randomly cleaved into short fragments by fragmentation buffer, then double-chain cDNA was synthesized with mRNA as a template. After purification, the cDNA segments were utilized to construct the cDNA libraries through end-repair, adding the tail of the ploy A, and selecting the size of the fragments. At last, the cDNA libraries were obtained by PCR amplification, and a total of 4 RNA libraries were sequenced using Illumina HiSeq™ 4000 sequence platform (Biomarker Technologies Corporation, Beijing, China).
By high-throughput sequencing, a large amount of raw data was obtained, and stored in FASTQ format file. The raw data was filtered to obtain clean reads. Clean

Phenotypic data
On behalf of taking extreme-pool samples with more accurate phenotypic traits, the chosen individuals from BC 7 F 2 population in 2016 were re-planted with plant-to-row to obtain BC 7 F 2:3 populations in 2017, of which 30 naturally opened bolls were subjected to fiber-quality testing ( Table 1) 1A). Obviously, there were more DEGs in CCRI45-vs-MBI9915 group (Fig. 1B) than those in High pool-vs-Low pool group (Fig. 1C), of which the former group identified over 8 times as many down-regulated DEGs, while the latter group identified more than 4 times as many up-regulated DEGs, implying the dramatic variations between the two group. Having performing the hierarchical clustering analysis on DEGs from two pairwise-comparison groups, most genes showed diverse expression levels either in two parent group ( Fig. 2A) or in two extreme pools ( Fig. 2B), while similar phenomenon occurred on the slight DEGs with high-ratio expression qualities between two groups, which might be closely related to the mechanism regulation of fiber strength.
GO enrichment analysis was subsequently conducted on the whole DEGs from two parent group, two extreme pools, and common DEGs (Fig. 3), respectively, aiming at identifying the putative functional genes or significant signal pathways during fiber development. Top GO terms were separately enriched into three main categories, and as for the DEGs from both two parent and two extreme pool groups or the common DEGs, cell part and cell were the predominant subcategories in cellular component, catalytic activity, and binding were the top subcategories in molecular functional category, and metabolic process, cellular process, and single-organism process were the most enriched subcategories in biological process.
Similarly, the whole DEGs were subjected to KEGG pathway analysis and were mapped onto five top-level subcategories of cellular processing, environmental information processing, genetic information processing, metabolism and organismal systems. At 20 DPA, DEGs in two parent groups (Fig. 4A) were mostly annotated into plant hormone signal transduction (ko04075),carbon metabolism (ko01200), while plant hormone signal transduction (ko04075) and starch and sucrose metabolism (ko00500) were the top two pathway subcategories for DEGs in two pool groups (Fig. 4B). The common DEGs both in two groups were mostly enriched to plant hormone signal transduction (ko04075) pathway subcategories (Fig. 4C).

qRT-PCR Verification of BSR-seq results
To confirm the reliability of BSR-seq results, qRT-PCR was performed on the five DEGs from two parents (CCRI36 and MBI9915) at 20 DPA, namely as Gh_A07G0831, Gh_A07G0837, Gh_A07G0838, Gh_A07G0858 and Gh_A07G0973, of which the primer sequences are enlisted in Table 3. According to the BSR-seq data, all the five DGEs showed significantly lower expression levels in MBI9915 than those in CCRI36, which presented the similar difference-pattern between two parents in qRT-PCR results.
Only Gh_A07G0831 showed non-significant difference on the expression levels between two parents (Fig. 5A), while Gh_A07G0858 (Fig. 5D) and the other three DEGs (Gh_A07G0837, Gh_A07G0838 and Gh_A07G0973) (Fig. 5B, C, and D) had significantly and extremely significantly higher expression qualities in CCRI36 compared with those in MBI9915, respectively. The qRT-PCR data were consistent with the BSR-seq results, fully demonstrating the reliability of this study. Table 3 Primers are used for qRT-PCR verification.
gene ID Forward primer Reverse primer Gh_A07G0831 TCCTCAACGTGCTCAACGGTTT CAACAAGGAAATGGCGCCCAAG Gh_A07G0837 CCAGGCAGTTACTGAGTCAGCC CATCTGCTCCGAGGCTCTTGAC Gh_A07G0838 CACCACCACCACCATCACCATC TGCCAGAATGGTGGGTTTCGTT Gh_A07G0858 GACGGAGCCTGATGCAGAGAAG AGGCTGGTGGCCAAATGACAAT Gh_A07G0973 CCGGAAAACACTGTCGGTGACT ATCCCAACGCCTTCAGCTTCTC Chromosome region confirmation and candidate gene identification  (Table 4 and Fig. 6). Referred to the TM-1 genome information, a total of 1981 genes were annotated in 18 associated regions, including 141 ones with nonsynonymous mutation, which were generally regarded the potential candidate genes involved in fiber strength formation. Meanwhile, GO enrichment and KEGG pathway analyses were conducted on the whole annotated genes, and the most genes were enriched to cell part and cell, catalytic activity and binding, and cellular process and single-organism process in cellular component, molecular function, and biological process based on GO categories (Fig. 7A), while ribosome (ko03010), carbon metabolism (ko01200), biosynthesis of amino acids(ko01230) and protein processing in endoplasmic reticulum(ko04141) were top 4 pathways with the greatest number of genes (Fig. 7B). Among all the 1981 genes, 69 genes and 93 genes were differentially expressed between two parents and between two extreme pools, respectively, while only 12 common DEGs were identified between two groups. According to annotation information of function genes shown in Table 5, 2 genes might play the significant roles during fiber development, namely as Gh_A05G1089 and Gh_A07G0837, which were separately annotated as abscisic acid transport in biological process and Golgi apparatus in cell component. Abscisic acid has been reported as a signal of secondary wall thickening to make great contribution to fiber development, and  Glycerophosphoryl diester phosphodiesterase family Gh_D08G0638 Protein of unknown function Gh_A07G0837 Golgi apparatus Gh_A11G1567 Regulator of chromosome condensation (RCC1) repeat Gh_A05G1141 Transcription factor regulating root and shoot growth via Pin3 Gh_A13G1277 EPSP synthase (3-phosphoshikimate 1carboxyvinyltransferase) Gh_D10G2220 Trypsin and protease inhibitor Gh_A13G1263 intracellular membrane-bounded organelle Gh_A05G1089 abscisic acid transport Gh_A05G1261 Chlorophyll A-B binding protein Clone, sequence and phylogenetic analysis on Gh_A07G0837 For further confirming the contribution of the candidate gene during fiber development, gene clone was conducted on the Gh_A07G0837 in CCRI36 and Hai1, respectively. The coding sequence (CDS) length was 807 bp both in CCRI36 and Hai1, while 2 SNPs were identified to separately locate in the 30th base and the 486th base (Fig. 8A), of which the former resulted in non-synonymous mutation, specifically encoding the aspartic acid in CCRI36 but the glutamic acid in Hai1, while the latter caused synonymous mutation. The CDS length of the candidate gene was only 627 bp in TM-1 reference genome, and the short 181 bp might be the second intron sequence based on annotation information, indicating the probable occurrence of alternative splicing (AS) event in CCRI36 and Hai1. Subsequently, the phylogenetic relationships of Gh_A07G0837protein sequences were analyzed among the CCRI36, Hai1 and other cotton species, and BLAST results showed the deduced amino acid sequence of Gh_A07G0837 in CCRI36 were highly similar to G.
arboretum was 80%identity and 97%, 95%, and 95%similarities (Fig. 8B). parents and between two extreme pools, respectively, while only 12 common genes were differentially expressed between two groups. Two candidate genes were finally screened based on functional annotation of reference genome, namely as namely as Gh_A05G1089 (abscisic acid transport) and Gh_A07G0837 (Golgi apparatus), of which the former has been reported to make great contribution to fiber development as a signal of secondary wall thickening (Gokani et al., 1998;Kurek et al., 2002), while the latter was involved in fiber development (Ramsey and Berlin, 1976

Ethics approval and consent to participate
Not applicable

Consent for publication
Not applicable

Availability of data and materials
The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests.

Funding
This study was supported by the National Natural Science Foundation of China  Having performing the hierarchical clustering analysis on DEGs from two pairwise-compariso Figure 3 GO enrichment analysis was subsequently conducted on the whole DEGs from two parent gro At 20 DPA, DEGs in two parent groups ( Figure 4A) were mostly annotated into plant hormone Figure 5 Only Gh_A07G0831 showed non-significant difference on the expression levels between two p Figure 6 All the identified SNPs were quantified with SNP-index, of which the information were synthe GO enrichment and KEGG pathway analyses were conducted on the whole annotated genes, a The coding sequence (CDS) length was 807 bp both in CCRI36 and Hai1, while 2 SNPs were id