Genome assembly and geospatial phylogenomics of the bed bug Cimex lectularius

The common bed bug (Cimex lectularius) has been a persistent pest of humans for thousands of years, yet the genetic basis of the bed bug's basic biology and adaptation to dense human environments is largely unknown. Here we report the assembly, annotation and phylogenetic mapping of the 697.9-Mb Cimex lectularius genome, with an N50 of 971 kb, using both long and short read technologies. A RNA-seq time course across all five developmental stages and male and female adults generated 36,985 coding and noncoding gene models. The most pronounced change in gene expression during the life cycle occurs after feeding on human blood and included genes from the Wolbachia endosymbiont, which shows a simultaneous and coordinated host/commensal response to haematophagous activity. These data provide a rich genetic resource for mapping activity and density of C. lectularius across human hosts and cities, which can help track, manage and control bed bug infestations.

, L104F, G108D, I109V) was directly involved in ATP binding. In the wild type protein, residues 58, 60 and 84 are in close proximity and form a hydrogen-bonding network that stabilizes loop formation in this region. The expectation was that a change from a small neutral to a larger charged residue (e.g. A58D, T84R) might cause reorganization of the loops. The comparison of the wild type and mutant DDL structural models suggests that a replacement to oppositely charged amino acids may lead to stronger interactions within this network. In addition to hydrogen bonds, strong ionic interactions occur between D58 and R84 in the mutant protein. This, in turn, leads to partial changes in adjacent flexible regions and may cause some alteration in ligase activity.

A B C
Supplementary Figure 11. Phylogenetic tree of insect infestins. The tree was generated using maximum parsimony and with a Strongylocentrotus infestin as an outgroup. Random Additions (n=100) were used with Tree Bisection Reconnection (TBR) branch swapping to obtain the tree. The colored names in the tree refer to the three major kinds of infestins that are suggested by this analysis. Red indicates the dipetalogastin family, the blue indicates the brasiliensin family (or infestin 4) and the green represents the infestin 1 family. The Cimex infestin is in the orange square. Three bedbug homologs (CLG00050, CLG13404, CLG00055) were found in bedbug with partial identity to other blood-feeding hemipteran insects (kissing bugs, R. prolixus and T. infestans), nested within the main cimicomorph esterase clade. Same maximum liklihood scale as Supplemental Supplementary Figure 12. Panels (A) maximum parsimony and (B) maximum likelihood show the dynamics of Evalue cutoff on the consistency of phylogenetic trees generated using the gene presenceabsence information. Majority-rule consensus parsimony and likelihood trees were calculated (bootstrap = 10,000), and for both of the majority-rule topologies, the relative support of each gene family matrix is shown as "Navajo rugs" [3] at each node. Black boxes indicate nodal agreement, white boxes indicate disagreement, and gray boxes indicate agreement with bootstrap support > 70%. To measure character consistency, the Rescaled Consistency Index (RCI) [4] was computed. To measure nodal agreement, the Consensus Fork Index (CFI) [5] and Rohlf consensus index 1 [6] were computed. The graphs (panels C and D) examine the dynamics of E-value cutoff analyses. These figures demonstrate an optimal E-value cutoff in the range e-50-e-75 for this dataset. All nodes on this tree received 100% bootstrap support.

Genes found to be microbial by Alien_Index
CLG36175 CLG03486 6 7 Supplemental Table 8 Three member hydrogen network between residues 96-98-168. X-ray structures used as templates for homology models highlighted in green. PDB code 96 Supplementary Methods 14 15 Raw sequence data 16 The genome assembly validated by the National Center for Biotechnology Information (NCBI), 17 where it was checked for adaptors, primers, gaps, and low-complexity regions. The genome 18 assembly has been approved and given the accession number JRLE00000000 and BioProject PRJNA259363. All genome sequencing data has been deposited in the Sequence Read Archive 20 (SRA) with accession number SRS749263. RNA-seq data is available as FASTQ files and were 21 quality-checked and deposited in the SRA with accession SRR1790655. 22 23 Biological samples 24 The bedbugs were taken from a Harlan strain colony maintained by Louis Sorkin (American 25 Museum of Natural History). The Har-73 strain was originally collected by Harold Harlan in 1973 26 from an infestation at the U.S. Army barracks in Fort Dix, NJ, and has been raised as a laboratory 27 pesticide-susceptible strain since that time. 28 29 Bedbug collection and feeding 30 Bedbugs were reared in ~236.6 ml (8 fl oz) glass canning jars where the metal covers had a 250-350 31 µm hole mesh screening heat-glued on the inside. Heat glue was applied to the outer circumference 32 of the screen surface to leave a 3 cm diameter central circle of exposed screen. Folded cardboard 33 was used as substrate. Jars were inverted on a human arm for feeding for 30 min on a monthly basis.  extractions were performed using a Trizol / RNeasy (QIAGEN) hybrid protocol, as detailed in [1]. 43 44 45 High throughput sequencing library quality check 46 Moleculo sequences were segregated into 3 bins by length: short (<7,501 bp), medium (7,501- with -F set to 4. The 'MD' tag was added to the resulting BAM files using the calmd 58 command of samtools producing SAM files containing this tag. The MD tag allowed for two 59 pieces of information to be extracted from the alignments: the total number of nucleotides 60 included in each alignment and the edit distance between the query and reference sequences.

61
The command used for obtaining the total sequence alignment length was in the assembly, no scaffolds were available from these data.

134
Metassembler 135 The ALLPATHS and Moleculo assemblies were combined into a single assembly using Transcriptome assembly 192 The bedbug transcriptome was produced using the Trinity assembler r2012-10-05 [7]. In order to  can be summarized as follows. First, CEGMA was run on the Metassembler-produced assembly.

220
The CEGMA results were used for training a SNAP v2006-07-28 [12] hidden Markov model produced with the hmm-assembler.pl program. Next, a second HMM was produced using 227 GeneMark-ES v2.3e [13,14]  .bz2) [15]. The maker_opts.ctl file was edited to include the Metassembler genome assembly for the 233 genome entry, the Trinity assembly for the est entry, the Acyrthosiphon pisum protein sequences for 234 the protein entry, the CEGMA/SNAP HMM file for the snaphmm entry, the GeneMark-ES HMM 235 file for the gmhmm entry, est2genome set to 1, protein2genome set to 1, keep_preds set to 1, and 236 single_exon set to 1. The first iteration of MAKER was run with these configuration values. The

237
MAKER program gff3_merge was used to merge together the resulting gff3 files from MAKER.

238
This merged gff3 file was used as input to SNAP to build a second HMM using the SNAP HMM 239 creation process as described previously. The genome.ann zff file created as part of the SNAP 240 HMM creation process was used to generate a gff3 file using the zff2gff3 program included in the 241 SNAP distribution. The Perl one-liner perl -plne 's/\t(\S+)$/\t\.\t$1/' was used to add an extra 242 column to the generated gff3 file for input to Augustus. An altered version of the autoAug.pl script 243 from Augustus v2.7 [16] was used to generate an Augustus HMM which used a GMAP v2014-02- 244 28 [17] alignment as a replacement for the BLAT [18] alignment used in the autoAug.pl pipeline.

245
The parameters used for autoAug.pl were the genome assembly for the --genome argument, the 246 transcript assembly for the --cdna argument, the gff3 file produced by SNAP for the --trainingset 247 argument, --singleCPU, -v, and --useexisting. The GMAP alignment was generated by first running 248 gmap_build on the genome assembly. The gmap command was then used to align the cdna.fa The Trinity transcriptome assembly was aligned to human reference genome version hg19 using the 269 STAR v2.3.1z aligner [22]. In order to accommodate the longer lengths of the transcript sequences 270 (compared to RNA sequencing read lengths), STAR was compiled with the STARlong option.

280
The awk program was used to filter the BLASTN results by alignment length and percent identity. set of RNA-seq data. Each of these SAM files was converted to a BAM file using samtools using the view command with parameters -Sb. Each BAM file was then sorted using the samtools sort 302 command. In order to perform pairwise differential expression analysis for each RNA-seq dataset, 303 the MAKER-generated gff file and BAM files were uploaded to the Rätsch Lab Galaxy [25]  where b2gPipe.properties.local points to a local Blast2GO database. We also used InterProScan 319 v5.5-48.0 [27] with the following parameters: -dp -f TSV,XML,GFF3 -goterms -iprlookup -i 320 Cimex_lectularius.

322
Human contamination of RNA-seq data 323 Unaligned reads retained when producing previously described RNA-seq alignments to the 324 Metassembler genome assembly were aligned to human genome hg19 using STAR. The samtools 325 view command was used to count aligned reads with the -S -c -F 4 options.

327
Active gene discovery 328 Sorted bam files for each developmental stage and sex as described previously were used as input to 329 the rpkmforgenes.py program [28]. Each replicate bam file was processed separately. The counts for genes considered active were plotted using Python's matplotlib.

334
Analysis of genes related to blood-feeding activity 335 Several suites amino-acid sequences from anticoagulants and other bioactive proteins involved Based on the computational model we concluded that among eight observed mutations, A58D, (branch-site random effects likelihood). In all cases the ML gene tree was used as guide tree.
that the kissing bug Triatoma pallidipennisuses as a scaffold for anticoagulants [53]. Lipocalin is 486 also found in the kissing bug Rhodnius prolixus. We also found for a variety of characterized 487 proteins with less obvious associations to a blood feeding lifestyle. Venom metalloproteases are 488 most intensively studied in the context of crotaline and viperine snake envenomations wherein their hemorrhagic activity relates to endothelial pathology, fibrinogenolysis and their ability to act as 490 disintegrins that inhibit platelet aggregation [54]. Zinc-binding metalloproteases are present in the 491 saliomic profiles of a wide range of arthropod sanguivores, including ticks [55], hookworms [56] 492 and cimicomorphs related to bedbugs; e.g., the reduviids [57]. Serine protease inhibitors are more 493 commonly associated with a blood feeding habit than are serine proteases [58]. Nonetheless, a 494 variety of these proteases and other trypsin-like plasminogen activators have been characterized 495 from the salivary transcriptomic profiles of the relatively closely related Triatoma matogrossensis 496 and Triatoma infestans [59]. These references were all used for the comparison to the bedbug 497 proteome and genome.

499
The raw sequences used to generated the tree were: Bayesian phylogenetic inference was also performed (lset rates=gamma; prset aamodelpr = mixed; 542 mcmc ngen=1,000,000; sumt burnin=200,000). The Bayesian tree was in broad agreement with the 543 MP tree. The genome assembly has been approved and given the accession number JRLE00000000 and 549 BioProject PRJNA259363. All genome sequencing data has been deposited in the Sequence Read 550 Archive (SRA) with accession number SRS749263. RNA-seq data is available as FASTQ files and 551 were quality-checked and deposited in the SRA with accession SRR1790655.