The Euphausia superba transcriptome database, SuperbaSE: An online, open resource for researchers

Abstract Antarctic krill (Euphausia superba) is a crucial component of the Southern Ocean ecosystem, acting as the major link between primary production and higher trophic levels with an annual predator demand of up to 470 million tonnes. It also acts as an ecosystem engineer, affecting carbon sequestration and recycling iron and nitrogen, and has increasing importance as a commercial product in the aquaculture and health industries. Here we describe the creation of a de novo assembled head transcriptome for E. superba. As an example of its potential as a molecular resource, we relate its exploitation in identifying and characterizing numerous genes related to the circadian clock in E. superba, including the major components of the central feedback loop. We have made the transcriptome openly accessible for a wider audience of ecologists, molecular biologists, evolutionary geneticists, and others in a user‐friendly format at SuperbaSE, hosted at http://www.krill.le.ac.uk.


| INTRODUCTION
Antarctic krill Euphausia superba (Dana, 1852; hereafter "krill") is a keystone species of the Southern Ocean ecosystem, a hugely abundant pelagic crustacean inhabiting a circumpolar belt between the Antarctic continent and the Polar Front (Nicol, Constable, & Pauly, 2000), with an estimated total biomass of 379 million tonnes (Mt) and postlarval production of 342-536 Mt/yr (Atkinson, Siegel, Pakhomov, Jessopp, & Loeb, 2009). It sits at the center of a "wasp-waist" food web (Atkinson et al., 2014) providing the major link between primary production and higher trophic levels and so converting phytoplankton to animal protein. Krill make up to 70% of the food intake of Southern Ocean predators such as seals, seabirds, whales, squid, and fish (Murphy et al., 2007), with an estimated annual demand of 128-470 Mt in the Southern Ocean ecosystem as a whole (Mori & Butterworth, 2006).
Krill also have an impact on the abiotic environment as a suggested "ecosystem engineer" (Murphy et al., 2013), playing an important role in the biological pump through sequestering carbon from the ocean surface to ocean interior (Perissinotto, Gurney, & Pakhomov, 2000;Tarling & Johnson, 2006), as well as facilitating the recycling of iron (Schmidt et al., 2011) and the production and uptake of ammonium (Whitehouse, Atkinson, & Rees, 2011). Recent studies have suggested that krill may be vulnerable to habitat loss as an effect of warming, with particular impact in areas accessible by predators and used by fisheries (Hill, Phillips, & Atkinson, 2013), and through ocean acidification (Kawaguchi et al., 2013). Recruitment, which is dependent on the survival of krill larvae overwintering under sea ice, is considered particularly vulnerable to climate change (Flores et al., 2012), and it will be vital to understand and predict the adaptability of krill to a changing environment and the potential consequences as it does so.
Commercially, krill provide the region's largest fishery by weight at 300,000 tonnes annually, mostly used in aquaculture as a bulk ingredient or nutritional additive in fish meal (Nicol & Foster, 2016). In recent years, there has been notable growth in the demand for products for human use in the nutraceutical industry, exploiting the high omega-3 polyunsaturated fatty acid content in krill. Krill oil is now the second most popular source of omega-3 after fish oil (Backes & Howard, 2014). The oil is taken as a treatment or preventative measure for conditions such as cardiovascular disease and arthritis; 29% of recent krill-related patents are categorized as for medical usage (Nicol & Foster, 2016).
As a keystone species of clear ecological and economic significance, active research is thriving, encompassing a broad spectrum of approaches from the population level down to the molecular. To cite further examples from recent years, this interest ranges from modeling population dynamics (Groeneveld et al., 2015) and food webs , through swarming behavior , diurnal migratory behavior (Cresswell et al., 2009;Gaten, Tarling, Dowse, Kyriacou, & Rosato, 2008), and physiological stressor responses (Auerswald, Freier, Lopata, & Meyer, 2008) to peptide evolution and phylogeny (Cascella et al., 2015) and changes in gene expression related to seasonal or molt status (Seear, Goodall-Copestake, Fleming, Rosato, & Tarling, 2012;Seear et al., 2010).
The enthusiasm for krill research is not always matched by the tractability of the organism itself, however; in contrast to the fruit fly Drosophila melanogaster, a model organism of some repute and power (Jennings, 2011), krill are remote, difficult to transport and maintain, and each discovery is hard won. The krill genome presents further challenges. Estimated at 47 gigabases (Jeffery, 2012), the genome of E. superba is more than twice the size of the largest genome sequenced so far, that of the loblolly pine Pinus taeda (Neale et al., 2014), and a sequencing project is considered unfeasible until further reductions in sequencing costs and improvements in long-read technologies are achieved (Jarman & Deagle, 2016). Alternative approaches to genetic questions have therefore been adopted for krill, such as expressed sequence tag (EST) libraries (De Pittà et al., 2008;Seear et al., 2009), microarrays (Seear et al., 2010) and, with increasing popularity as the costs of RNA-seq continue to fall, transcriptomic data that can provide gene coding sequences from reconstruction of mRNA transcripts (Clark et al., 2011;De Pittà et al., 2013;Martins et al., 2015;Meyer et al., 2015;Sales et al., 2017).
In this study, we describe the creation and annotation of SuperbaSE, a new de novo assembled head transcriptome for E. superba that is openly available online to interested researchers. With the aim of reconstructing as comprehensive a set of accurate, full-length transcripts as possible, we used multiple variations of the de Bruijn graph method of de novo assembly, in which reads are broken down into a library of k-mers, "words" of length k, prior to reassembly (Compeau, Pevzner, & Tesler, 2011). The output generated by this method varies with k-mer length, with low values favoring transcript diversity and reconstruction of lowly expressed genes at the cost of fragmentation and accuracy, while high values favor higher accuracy over a broader abundance range, at the expense of diversity (Robertson et al., 2010).
De Bruijn assembly software packages can also vary in their output when using the same k-mer length (Xie et al., 2014).
Our own research interests revolve around the circadian clock, a field which until recently was lacking genetic data for crustaceans (Strauss & Dircksen, 2010), although matters have improved in recent years (Christie, Fontanilla, Nesbit, & Lenz, 2013;Tilden, McCoole, Harmon, Baer, & Christie, 2011;Zhang et al., 2013) and the head transcriptome was created as a gene discovery resource to drive progress in this area. Prior to the decision to adopt an RNAseq approach, we cloned and Sanger sequenced a number of core circadian genes for the krill through degenerate PCR and RACE extension; for some, we obtained complete coding sequences while achieving only partial success with others due to difficulties with degenerate PCR and RACE extension. As an example of the utility and potential of SuperbaSE, we relate here its use in completing our partial sequences and identifying many others relevant to the molecular clock, including regulatory genes involved in the generation and maintenance of circadian periodicity (Özkaya & Rosato, 2012) and those downstream whose expression is regulated by the clock (clock-controlled genes).

| Animal collection
The collection of E. superba samples took place during the Antarctic summer in February 2008, on the Discovery 2010 cruise JR177.
Swarms were identified using an EK60 echosounder North-West of South Georgia at 52°S and caught by target fishing using a pelagic RMT8 net. Catches were taken at 1 a.m., 6 a.m., 1 p.m., and 8 p.m. local time. Immediately after collection between 30 and 200 individuals from each net were flash frozen by placing them into tubes that were then plunged into methanol at −80°C, moved to a −80°C freezer until the end of the cruise and returned to the UK without thawing.

| RNA extraction
For each wild catch, three frozen krill heads were combined and powdered with mortar and pestle. Samples were kept frozen by continuous addition of liquid N 2 and by working on a bed of dry ice. Total RNA was extracted with TRIzol ® reagent (Thermo Fisher Scientific, UK).
RNA was resuspended in DEPC-treated water and the concentration and purity [A(260 nm)/A(280 nm) >1.8] determined using a NanoDrop spectrophotometer (Thermo Fisher Scientific). One microgram of total RNA was electrophoresed on a nondenaturing 1.5% (w/v) agarose gel to check for degradation. The Qiagen oligotex ® mRNA extraction kit was used to isolate mRNA from one microgram of total RNA following the manufacturer's protocol. The mRNAs from each catch were combined into a single sample >10 μg and >250 ng/μl, which was delivered to BGI Tech Solutions (Hong Kong) Co Ltd (BGI) for generation of raw read data. The sample was subject to polyA enrichment using oligo-dT beads, then fragmented and used to construct a cDNA library with an average fragment size of 200 base pairs (bp). This was primed with random hexamers and sequenced using the Illumina HiSeq 2000 platform to generate 100-bp paired-end reads.

| De novo transcriptome assembly
The read data were subject to quality control by BGI, removing adapter sequences, contamination, and low-quality reads, and once retrieved Fast QC v0.11.2 was used for further quality assessment and Trimmomatic 0.32 (Bolger, Lohse, & Usadel, 2014) to remove low quality leading and trailing bases.
Assembly was performed using Trinity r20140717 (Haas et al., 2013), Bridger r2014-12-01 (Chang et al., 2015), Trans-ABySS 1.5.1 (Robertson et al., 2010), and SOAPdenovo-Trans 1.03 (Xie et al., 2014). Trinity restricts the user to a k-mer setting of 25. Bridger has a maximum permitted k-mer of 31, and six assemblies were generated using k-mers from 21 to 31 in two-step increments. Trans-ABySS and SOAPdenovo-Trans can use higher k-mer settings, and for each assembler, eight assemblies were created using k-mers from 21 to 91 in 10-step increments. A minimum contig length of 200 bp was specified for assembly.
For each assembler that permitted varying k-mer settings, an assembler-specific merged assembly was produced. For Trans-ABySS, the built-in function transabyss-merge was used, while for the others the contig headers in each assembly were amended to avoid crossassembly duplicate IDs and the assemblies concatenated into a single FASTA file. These were then subject to the removal of redundancy using the dedupe.sh function of BBMap (http://sourceforge.net/proj ects/bbmap/), set to remove duplicates and containments at 100% identity including reverse complement comparisons.
Each merged assembly was assessed using the read metrics function of TransRate 1.01 (Smith-Unna, Boursnell, Patro, Hibberd, & Kelly, 2016), which generates an output file of those contigs considered to be well assembled as supported by read-mapping evidence. The "good contigs" files for each assembler were combined into a single file that was processed a second time using BBMap and TransRate to select the absolute best unique contigs across all assemblers. Going forward, this selection of contigs is referred to as the total assembly.
Transdecoder (Haas et al., 2013) was used to identify likely coding contigs with an open reading frame coding for peptides of 100 amino acids (aa) or longer-only the longest identified ORF per contig was retained. The output was processed using CD-HIT (Fu, Niu, Zhu, Wu, & Li, 2012) at 100% identity to remove duplicates, resulting in the peptide assembly. The contigs from the total assembly that encode these peptides were extracted into a new database, hereafter referred to as the coding assembly.

| Transcriptome annotation and analysis
The peptide assembly was queried against the SwissProt protein database from the UniProt Knowledgebase (UniProtKB) using the blastp function of BLAST+ 2.5.0 (Camacho et al., 2009). The output was set to return the single best result (-max_target_seqs 1, -max_hsps 1) with an E-value of 1.0e −6 or lower. All contigs from the peptide assembly that failed to return a result were subsequently queried against the arthropod protein database of the UniProtKB with the same criteria.
Both protein databases were retrieved from ftp://ftp.ebi.ac.uk/pub/ databases/fastafiles/uniprot/ on 26th November 2016. The coding assembly was also subject to sequence homology annotation using the blastx function of BLAST+ against the two protein databases using the same method. The accessions returned by these searches were used to retrieve associated Enzyme Codes, GO terms, KEGG, and InterProScan IDs from the UniProt website (http://www.uniprot. org/uploadlists/) using the Retrieve/ID mapping function. The peptide assembly was further annotated using functions built into Trinotate (Haas et al., 2013) to identify Pfam protein domains.
To provide an overview of the information present in our assembly and compare it to the current most extensive krill molecular resource, KrillDB (Sales et al., 2017), we used blastx annotation results from the high quality, manually reviewed SwissProt database as a metric. The SuperbaSE coding assembly, KrillDB (retrieved from http://krilldb.bio. unipd.it/ on 14 March 2017) and the mRNA transcripts derived from the genome of another crustacean, Daphnia pulex (Colbourne et al., 2011; retrieved from http://wfleabase.org/genome/Daphnia_pulex/ current/mrna on 5th April 2017) were subject to this search using an E-value cutoff of 1.0e −6 or lower, the output again set to return the single best result. Results were used to retrieve GO annotation from UniProtKB as above and summarized using WEGO (Ye et al., 2006).
To identify the extent of the overlap between KrillDB and the coding assembly, the two resources were further subject to a reciprocal BLAST search using blastn, E-value 1.0e −20 or lower, returning the single best result.

| SuperbaSE: the online Euphausia superba transcriptome database
The transcript and annotation data of the peptide and coding assemblies were merged, processed and converted into an appropriate format for online viewing. Each entry was given sequential and consistently formatted gene, transcript and ORF IDs and assigned putative gene names based on BLAST annotation results. The site was created using a Catalyst framework to access annotated transcriptome data stored in a MySQL database, and the front-end design makes use of template toolkit and Twitter Bootstrap.

| Transcriptome mining
Coding sequences for E. superba bmal1 (Es-bmal1, GenBank accession KX238952), clock (Es-clk, KX238953), cryptochrome 1 (Es-cry1, KX238951), and cryptochrome 2 (Es-cry2, also reported as Escry by Mazzotta et al. (2010)) had already been successfully cloned and Sanger sequenced in full when the head transcriptome project was undertaken, while E. superba period (Es-per, KX238955) and timeless (Es-tim, KX238954) had both been partially cloned and sequenced (Table 1 and Supplementary Data). These sequences were used to query the total assembly to determine the extent and accuracy of their reconstruction in the transcriptome and, in the case of Es-per and Es-tim, to obtain the full coding sequences. The queries were performed by translation of the nucleotide sequences to putative peptides using ExPASy Translate (http://web.expasy.org/translate/), with these then used to mine the total assembly using the tblastn function of BLAST+ with an E-value cutoff of 1.0e −3 . For Es-per and Es-tim, the output was used to design primers for confirmation of the full coding sequences of both genes as outlined above (Supplementary Data, Table S2).
For regulatory and clock-controlled genes, the peptide sequences of D. melanogaster/Mus musculus genes known to contribute to or interact with circadian timekeeping were used to search the total assembly for putative E. superba orthologs using the tblastn function of BLAST+ with an E-value cutoff of 1.0e −3 . To add additional support to such identification, candidate contigs were extracted from the assembly database and translated, and the peptide sequences blasted against D. melanogaster proteins curated in Flybase (Attrill et al., 2016) and against the NCBI nonredundant proteins database (excluding uncultured/environmental sample sequences and all entries with a title containing the words "putative," "hypothetical," or "predicted"). The derived peptide sequences obtained from this process were used to search KrillDB with the tblastn function with an E-value cutoff of 1.0e −3 to compare the presence and extent of transcript reconstruction of circadian-related genes in the two resources. In each case, KrillDB was also queried with the D.melanogaster/M.musculus ortholog used in the initial search process to search this resource for superior candidates.

| Sequencing data and quality control
34,918,657 clean paired-end 100-bp reads (69,837,314 in total) were generated by the Illumina sequencing. After Trimmomatic processing for low-quality bases at each end of all reads, the read lengths ranged between 81 and 100 bases.

| Assembly and annotation metrics
The total assembly initially comprised 484,125 contigs. Submission to the NCBI Transcriptome Shotgun Assembly database required the assembly to be subject to a contamination screen, with 45 contigs removed as a result and 484,080 contigs submitted to the TSA under accession GFCS00000000. The total assembly achieved a TransRate score of 0.42, an improvement over all the individual multi-or single k-mer assemblies produced by any one assembly package (Table 2).
There were also improvements for the total assembly in the percentage of total read mappings, the percentage of good read mappings, the percentage of contigs with identified open reading frames (ORFs), and a reduction in the number of perceived bridges, which T A B L E 1 Core circadian genes cloned and Sanger sequenced for Euphausia superba. Es-cryptochrome 2 has been previously reported as Escry (Mazzotta et al., 2010) Gene name indicate transcript fragmentation. The contig mean length and assembly N50 of the total assembly were notably higher than the output of Trinity, SOAPdenovo-Trans, and Trans-ABySS, yet lower than the output of Bridger, signaling a proportionally large contribution of contigs from the latter; indeed, Bridger provided 50% of contigs in the total assembly, followed by Trans-ABySS, SOAPdenovo-Trans, and Trinity with 26%, 15%, and 9% respectively. From the total assembly, TransDecoder identified 147,450 contigs (the coding assembly) encoding putative peptides (the peptide assembly). Of these, 27,413 were deemed by TransDecoder to represent complete peptides; 60,826 were identified as internal fragments, 42,311 as missing the 5′ end and 16,900 as missing the 3′ end.
BLAST annotation of the peptide assembly by sequence homology returned 67,762 peptide contigs with hits from the UniprotKB SwissProt dataset, with an additional 20,972 contigs annotated using the UniprotKB arthropod dataset. A further 3,467 contigs were annotated by querying SwissProt and the arthropod dataset with the coding assembly nucleotide sequences for an overall total of 92,201 contigs annotated in either peptide or nucleotide form. Of these contigs, 80,149 were subsequently annotated with at least one GO term, 57,094 with at least one KEGG identifier, and 86,405 with at least one InterProScan accession. Further to this, 67,782 contigs were annotated with Pfam domain identifiers (whole sequence E-value of 1.0e −6 or lower), including 2,044 contigs that were not BLAST annotated.
The number of unique accessions returned by this BLAST annotation process was 23,931; of these, 10,762 were assigned to a single contig, while the rest were assigned to two contigs or more to reach the total number. The SwissProt entry Q9V7U0, for example, a pro-resilin precursor from Drosophila melanogaster, annotated 1,148 contigs, and another 32 unique accessions were each assigned to 100 contigs or more. By way of comparison, a single assembly generated using Bridger at k-mer = 25 (Bridger25) received 12,616 unique accessions from SwissProt (Supplementary Data, Table S3). Q9V7U0 was assigned to 274 contigs in this assembly, suggesting a duplication factor of 3.8 in the total assembly; of the SwissProt accessions assigned to 100 contigs or more in the peptide and coding assemblies, the duplication factor varied from 2.03 to 8.2. T A B L E 2 Comparison of de novo assembly quality using TransRate contig and read-mapping metrics. Bridger, SOAPdenovo-Trans, and Trans-ABySS assemblies are multi-k-mer assemblies, see Methods for details. See Smith-Unna et al. (2016)

| Transcriptome mining
Searching the total assembly using the complete coding sequences of Es-bmal1 (Figure 2), Es-clk (Figure 3), Es-cry1 (Figure 4), and Escry2 ( Figure 5) returned transcripts representing between 88 and 100% of each sequence; in the case of Es-cry2, this was a single, complete transcript, while the others returned 2-4 fragments. Searching the assembly using the fragments of Es-per and Es-tim returned a complete coding sequence for the former (Figure 6), and two large fragments, covering nearly the complete sequence, for the latter (Figure 7). In both cases, these full sequences were subsequently confirmed by PCR, cloned, and Sanger sequenced (Supplementary Data and Table 1). In some cases, minor variations were seen when comparing the peptide sequences derived from assembled transcripts to those confirmed by high fidelity PCR, and in the case of Es-CLK, there is a notable stretch of disagreement with particular contigs from residues 561-641, and a smaller stretch later on (Figure 3). F I G U R E 3 Es-CLOCK (accession KX238953) aligned with fragments mined from the total assembly. Yellow highlights represent sequence data not found in the total assembly. Red text highlights residues inconsistent between the confirmed PCR sequence and transcriptome sequences F I G U R E 4 Es-CRY1 (accession KX238951) aligned with fragments mined from the total assembly. Yellow highlights represent sequence data not found in the total assembly. Red text highlights residues inconsistent between the confirmed PCR sequence and transcriptome sequences Searching KrillDB for circadian-related transcripts (core, regulatory, and clock-controlled genes) generated largely similar results, albeit that the majority of transcripts returned from the total assembly were longer than their KrillDB counterparts (Table 6). In KrillDB, a 303 bp 3′ fragment was the only evidence found of Es-bmal1, and Es-per was represented by a 2,339 bp fragment, both less extensive than the results from the total assembly. Conversely, Es-tim was represented by 4,047 bp contig in KrillDB and Es-cry1 by a 1,819 bp contig, both covering the complete coding sequence in contrast to the shorter, fragmented transcripts in the total assembly. By visual inspection of the results, no equivalent was found in KrillDB for the total assembly contigs identified as representing the genes PP2A microtubule star, PP2A widerborst, and takeout-in the case of the first two genes the top result from KrillDB was a less compelling candidate compared to the output from the total assembly and each was also found to be present in the total assembly, but rejected in favor of the contigs listed in Table 3. In the case of takeout, no candidate contig was found in KrillDB.

| SuperbaSE: The online Euphausia superba transcriptome database
SuperbaSE is hosted at www.krill.le.ac.uk. The site features a Quick Search which scans transcript/contig and GO fields for the input search text and an Advanced Search function with a wider set of options.
Results can be exported as a raw FASTA file of nucleotide sequences, or individual results can be viewed on their own page further showing the predicted peptide sequence and hyperlinked annotation data that permits the retrieval of details from UniProt and other sources. Figure 8 shows the data for the circadian gene Es-period as it is found on the website, including functional information. As an extra resource, the peptide assembly, coding assembly, and total assembly are available as searchable BLAST databases using the SequenceServer front end (Priyam et al., 2015), permitting fast, simple identification of genes of interest.

| Assembly and assessment
SuperbaSE was created as a tool for gene discovery. As such, the pipeline adopted for assembly focused on generating a comprehensive set of correct and complete transcripts, and one of the major factors affecting the reconstruction of transcripts from short read data is the k-mer length employed in assembly. Typically, as k-mer length increases, so too does average contig length and the median coverage depth of contigs assembled, while contig number falls (Gibbons et al., 2009); low k-mers therefore favor transcript diversity, while F I G U R E 5 Es-CRY2 (accession CAQ86665, Mazzotta et al. (2010)) aligned with contig mined from the total assembly F I G U R E 6 Es-PERIOD (accession KX238955) aligned with contig mined from the total assembly. Red text highlights residues inconsistent between the confirmed PCR sequence and transcriptome sequence. Black arrows denote the point at which sequence data from 3′ RACE extension ends-beyond this the transcriptome was necessary to complete the sequence  Table S3), despite the fact that, of these, only the coding assembly had been subject to processing to remove short, noncoding and low-quality contigs which may otherwise have returned annotations, highlighting how information present in the read data can be missed if a comprehensive approach is not adopted.
Comparative methods of transcriptome assessment that use a related organism as a measure of completeness (O'Neil et al., 2013) focus on conserved sequence identification at the expense of novel and divergent transcripts and lose relevancy with increasing evolutionary distance between subject and reference. TransRate, in contrast, has the benefit of directly assessing contig quality based on read-mapping metrics, identifying chimeras, gene family collapses, insertions, fragmentation, local misassemblies, and redundancy. This latter issue was further addressed through the use of BBMap's deduphe.sh at the nucleotide level and CD-HIT at the peptide level to remove absolute duplicates. Despite these efforts, redundancy is still an issue. The total assembly contains nearly half a million contigs yet no individual assembly produced more than 160,000 contigs (data not shown). De novo assembly is a sensitive, error-prone process and small differences in contig output across assemblers or k-mers may produce many variants representing the same in vivo transcript. This is indicated in the F I G U R E 7 Es-TIMELESS (accession KX238954) aligned with contig mined from the total assembly. Yellow highlights represent sequence data not found in the total assembly. Red text highlights residues inconsistent between the confirmed PCR sequence and transcriptome sequence. Black arrows denote the point at which sequence data from 3′ RACE extension ends-beyond this the transcriptome was necessary to complete the sequence number of accessions assigned to more than one contig, with those assigned to multiple contigs representing 55% of the total number.
As illustrated by the example identified in the Results, that of the proresilin precursor of Drosophila melanogaster, this can stretch to over a thousand contigs, although this particular annotation also appeared 274 times in a single assembly. This problem could be assuaged by relaxing the criteria for duplicates when running dedupe.sh and CD-HIT -by removing duplicates at 95% identity rather than 100%, for example-but again, given the focus on gene discovery, we preferred to tolerate redundancy over the possibility that correct transcripts may be lost, and we therefore offer this resource with the caveat that searches are likely to return multiple results worthy of investigation and that do not necessarily represent isoforms.

| Functional and comparative analysis
The coding assembly of SuperbaSE compares favorably with the majority of existing molecular resources for E. superba in terms of T A B L E 5 Preprohormone query proteins with E. superba output contigs and subsequent blastp analysis against NCBI nonredundant protein database

Peptide family (subfamily)
Output from total assembly using Accession as query Output from NCBI NR using total assembly contig as query Candidates for these genes also reported by Toullec et al. (2013).
T A B L E 5 (Continued) T A B L E 6 Comparison of circadian-related contigs from the total assembly of SuperbaSE and equivalents found in KrillDB. Longest accepted candidate contig shown for each assembly. "Agreement" shows percentage of identical residues in the largest HSP in each search result alignment

| Circadian gene discovery
Using SuperbaSE, we have successfully identified the complete sequences for Es-per and Es-tim, two core circadian genes over 3.5 kb in length with low expression levels (Supplementary Data, Table S4).
Previously cloned circadian genes of similarly low expression levels were well represented by extensive fragments, and deviations from the sequences obtained by high fidelity PCR were minimal with the exception of Es-clk. We suggest that the causes of this particular exception are 1) a misassembly at the 3′ end of the contig ES.21_ comp24303_seq0, as contig ES.31_comp16417_seq0 is in agreement with the PCR sequence, and 2) the inherent difficulty in reassembling a region showing a high level of repeats. Much of the disagreement is seen in or near an extraordinary long poly-Q region, a common feature of Clock genes (Johnsen et al., 2007;O'Malley & Banks, 2008;Saleem, Anand, Jain, & Brahmachari, 2001); to our knowledge, the poly-Q of Es-CLK is the longest example yet found, surpassing that of the river prawn Macrobrachium rosenbergii (Yang, Dai, Yang, & Yang, 2006). Our findings show that E. superba possesses a full complement of core clock proteins in contrast to D. melanogaster, which lacks a CRY2, and M. musculus, which lacks TIMELESS and CRY1 (Reppert & Weaver, 2000). The krill clock shares this characteristic with a number of other arthropod species including the monarch butterfly Danaus plexippus (Zhu et al., 2008), and it has been suggested that this represents an ancestral form from which the fly and mouse clocks have been derived.
Further research is now needed to link the molecular clock to the rhythmic phenomena observed in E. superba. Diel vertical migration, in which krill rise to the surface at night to feed and return to deeper waters at dawn, is likely influenced by the biological clock (Gaten et al., 2008). Krill also exhibit annual rhythms with a molt-shrink regression to a juvenile state over winter, returning to maturity in spring (Kawaguchi, Yoshida, Finley, Cramp, & Nicol, 2007), and seasonal cycling of metabolic rate (Meyer et al., 2010). At the genetic level, this may be governed by photoperiod (Seear et al., 2009), and circadian clock genes have been F I G U R E 8 Sequence and annotation data found in SuperbaSE for circadian gene Es-period suggested to play a role in photoperiodic responses in D. melanogaster (Pegoraro, Gesto, Kyriacou, & Tauber, 2014). Es-cry2 has shown rhythmic expression in both light-dark cycling conditions and in constant darkness with a short circadian period of 18 hr (Teschke, Wendt, Kawaguchi, Kramer, & Meyer, 2011), a finding interpreted as evidence of a clock with a wide range of entrainment and consistent with the extreme variation in photoperiod that E. superba is subject to, from constant light in summer to constant darkness during winter. With a full set of core circadian genes now cloned and sequenced, this intriguing system can be further characterized, and our preliminary results from assays on the interactions of the krill clock proteins indicate that the krill clock retains the functionality of both the fly and mouse mechanisms (data will be presented elsewhere) to an extent not seen in other species so far studied.
We have furthermore identified 18 putative full coding sequences for regulatory and clock-controlled genes plus extensive fragments for three more such genes of interest and 21 preprohormone candidate contigs, the majority not previously reported. As an example of its usage outside the field of chronobiology, SuperbaSE has also been employed in the identification of an ancient conserved noncoding element in the 5′ region of the Not1 gene (Davies, Krusche, Tauber, & Ott, 2015).

| CONCLUSION
We have described the assembly and annotation of SuperbaSE, a new transcriptome resource for Euphausia superba. Using this resource, we have successfully identified an extensive suite of circadian and clock-related genes, including some that had proven difficult to clone using traditional methods. This has laid the groundwork for the molecular dissection of the circadian clock in this vital species in order to understand how these components interact to generate genetic, T A B L E 7 Comparison of published E. superba de novo assemblies. "No. of reads" shows raw read count of new sequencing based on the samples described in this