Exploring new genomic territories with emerging model insectsGenome-enabled model insect systems

Improvements in reference genome generation for insects and across the tree of life are extending the concept and utility of model organisms beyond traditional laboratory-tractable supermodels. Species or groups of species with comprehensive genome resources can be developed into model systems for studying a large variety of biological phenomena. Advances in sequencing and assembly technologies are supporting these emerging genome-enabled model systems by producing resources that are increasingly accurate and complete. Nevertheless, quality controls including assessing gene content completeness are required to ensure that these data can be included in expanding catalogues of high-quality references that will greatly advance understanding of insect biology and evolution. Using a chromosome-level genome assembly of the aphid, Myzus persicae, and 40X short-read sequencing of 127 clones derived from 19 countries, this study shows how a high-quality reference enables comprehensive characterisations of genome-wide patterns of global genetic variation. This system helps understand genomic responses of insects to insecticide exposures and is particularly relevant for the control of agricultural pests. a for investigating the relationships between key Chromosome-level assemblies generated for five of the 17 bumblebee species in this study allowed tracing of the rearrangements that created the unusual 25-chromosome karyotype in social parasites. The high-quality genome resources from sampling species across the genus supported the quantification of genetic and genomic variation across the Bombus phylogeny, where high levels of gene tree discordance are likely driven by incomplete lineage sorting. resources described colour colour


Introduction
Model organisms can be described as non-human species that are studied to advance the understanding of biological phenomena, with traditional model species being easily bred in the laboratory and amenable to experimental manipulation [1]. The common ancestry of living organisms means that insights from such models also inform knowledge of molecular and genetic mechanisms underlying common biological functions across the tree of life.
Representing insects is the renowned model, the fruit fly Drosophila melanogaster, with ground-breaking work on fields from genetics and heredity to behaviour, physiology, development, immunity, and countless others [2]. A major contributing factor to the success of Drosophila as a versatile model over the last two decades was the establishment of a reference genome assembly and its functional genomic element annotations [3]. Developing new models with reference genomes and experimental tools analogous to those available for Drosophila can be challenging, but is important for diversifying the systems we use to learn about organismal biology [4,5]. Currently, substantial advances in sequencing technologies mean that it can be more readily feasible to generate a high-quality genome for a new species than it is to rear in the laboratory. This genomics revolution is opening up a whole new set of possibilities considering a shift from the traditional model organism to the concept of species or groups of species that offer the ability to develop new model systems for studying a large variety of biological phenomena at many different levels [6,7].

Conserved orthologues help gauge gene content completeness of accumulating genome resources
Recent surveys of the current status of available genome resources for insects focus on taxonomic representation, assembly quality metrics, gene content completeness, and sequencing technology usage [8][9][10]. These highlight the continued rapid accumulation since previous surveys e.g. [11,12], and show current biases in species sampling with several insect orders still lacking publicly available resources. Notably, long-read data, e.g. from approaches developed by Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT), are helping to improve assembly contiguity and produce more complete and accurate representations of new and upgraded insect genomes. For these resources to support the development of emerging model systems, they need to be of the highest possible quality, not only in terms of assembly statistics but also with respect to gene content representation.
The need to assess quality in terms of expected gene content prompted the proposal of Benchmarking Universal Single-Copy Orthologues (BUSCOs) [13]. BUSCO relies on the expectation that single-copy orthologues present in most species within a taxonomic lineage should be identifiable in any new genome from a species in the same clade. The BUSCO J o u r n a l P r e -p r o o f lineage datasets are built by identifying near-universal single-copy orthologues from the OrthoDB orthology resource [13][14][15]. Using these to evaluate assemblies starts with BUSCO sequence searches to guide gene predictions, then orthology classifications identify complete, duplicated, or fragmented BUSCOs. The numbers of identifiable BUSCOs provide an indication of gene content completeness based on expected subsets of evolutionarily conserved genes for a given lineage. High completeness scores thereby imply that a genome assembly confidently represents the complete gene repertoire.
BUSCO completeness is also recognised as an important quality check of resources for new model systems and for cataloguing eukaryotic genomic biodiversity, e.g. the Earth BioGenome Project (EBP) standards recommendations for genome generation include achieving recovery of more than 90% single-copy conserved genes [22].
Using results from the Arthropoda Assembly Assessment Catalogue (A 3 Cat) [10,23] to survey BUSCO completeness of insect genome assemblies deposited at the United States National Centre for Biotechnology Information (NCBI) shows that while many do meet the EBP's standards recommendations, quality in terms of gene content completeness still varies dramatically ( Figure 1). Thus while the NCBI may currently offer more than 2'500 assemblies for insects, fewer than half of these achieve a complete and single-copy BUSCO score >90% and most do not yet reach the EBP's standard of having the majority of sequences assigned to chromosomes. Notably however, accuracy-enhanced long-read technologies together with scaffolding approaches such as high-throughput chromatin conformation capture (Hi-C) are more consistently producing high-quality new genome resources, which are greatly expanding the possibilities for developing new insect model systems.
J o u r n a l P r e -p r o o f Advances in taxonomic sampling of insects for genome sequencing have been reviewed for ants and other Hymenoptera [24,25], hemipterans [26], beetles [27], flies and other Diptera [28,29], butterflies and other Lepidoptera [30], and many others [9,11,31].
Here we focus on a selection of recent examples of high-quality genomics resources ( Table 1) that are supporting the use of new species or groups of species to develop and expand emerging model systems that help advance understanding of insect biology and evolution.
Mayflies have long been the focus of many ecological studies, and together with dragonflies and damselflies they form the sister group to all other winged insect lineages.
Recent establishment of a continuous culture system of the Cloeon dipterum mayfly [32] allows for comprehensive life-stage and tissue sampling for detailed transcriptional profiling.
Combining short reads with ONT sequencing data enabled the assembly of its relatively compact genome of 180 Megabasepairs (Mbp) in 1'395 scaffolds with 96%-97% complete BUSCOs (Table 1), and annotated with 16'357 protein-coding genes. These resources lay the foundations for investigating genomic adaptations to aquatic and aerial life and the origin of insect wings in this emerging model system [32].
Combining long-reads with Hi-C data is proving to be an effective approach for generating chromosome-level assemblies. This was recently demonstrated by Sun et al. [33] for five of 17 new high-quality bumblebee genomes (Table 1) [34,35], they also offer opportunities to explore genetic factors influencing the plastic and adaptive responses impacting insect resilience to climate change [36].
Rearrangements like those observed for the social parasite bumblebees appear to be infrequent in some well-studied groups such as Diptera and Lepidoptera where global genome architectures are generally conserved. Therefore, models from other diverse insect groups are needed to investigate different modes of genome structure evolution. Indeed, analyses of high-quality chromosome-level assemblies of aphids (Table 1) show that their autosomes have undergone dramatic reorganisations in contrast to their sex chromosomes where gene content of the X chromosome has remained highly stable [37,38]. As a model system to investigate the evolution of resistance to insecticides, reference-quality aphid genomes are also enabling comprehensive assessments of within-species variation to understand genomic responses to strong selective forces [39].

J o u r n a l P r e -p r o o f
The pea aphid was one of the first insects to be sequenced and has served as a valuable model for understanding genomic consequences of host-symbiont interactions.
However, genomic resources for new systems are needed to explore the many types of endosymbioses found across different insects. The genome of the rice weevil, Sitophilus oryzae, is not yet assembled to chromosome level but shows high BUSCO completeness (Table 1) thereby providing a confident basis from which to investigate how key metabolic processes might be partitioned between host and endosymbiont [40]. Quality and completeness are also particularly critical when tracing cases of horizontal gene transfer, e.g. duplicated bacterial-origin mannosidases in the 1150 Mbp genome assembly of the stink bug Halyomorpha halys [41], and bacterial cell wall hydrolase genes acquired by Coccinellinae ladybird beetles identified in the high-quality genome of Cryptolaemus montrouzieri [42].
Amongst the most well-known of the Coccinellinae, the harlequin ladybird Harmonia axyridis is widely considered to be one of the world's most invasive insects. Many insects are, or have the potential to become, invasives that can cause great damage to natural ecosystems or agricultural crops. Accumulating genomics resources from a variety of insect groups are helping to diversify the models used to study invasion biology and potentially develop new genetic control measures. Hi-C data helped to build a chromosome-level assembly for the two-spot harlequin morph, but with lower BUSCO completeness than prior to Hi-C scaffolding [43] (Table 1). These data, along with assemblies for other morphs e.g. [44], also offer new opportunities to develop the use of these ladybirds, which display more than 200 described colour forms, as an important model system for investigating the genetics of colour pattern polymorphisms [45,46].
Being laboratory-tractable is a key feature of the most versatile model species. For example, the painted lady butterfly, Vanessa cardui, can be easily reared in the laboratory and is amenable to CRISPR/Cas9 genome editing, making this widespread, generalist species with complex wing patterns an excellent model. The genome assembly, recently upgraded to chromosome level [47], with transcriptomics data from multiple tissues and developmental stages provides the framework to employ genetic manipulations and functional genomics data for studying migration, host-plant coevolution, and colour patterning [48]. CRISPR/Cas9 has also been established for the tea geometrid moth, Ectropis grisescens, which, along with its relevance as an agricultural pest, presents an interesting system for studying insect interactions with plant allelochemicals as well as shape and colour adaptations for effective camouflage. Hi-C scaffolding of PacBio data placed 97.8% of the assembly on 31 chromosomes with an assembly span of 785 Mbp (Table 1) and 18,746 annotated protein-coding genes. The genome maintains the ancestral lepidopteran karyotype (n=31), and separate resequencing of male (ZZ) and female (ZW) individuals allowed for the identification of the Z chromosome and several W candidate scaffolds [49].

J o u r n a l P r e -p r o o f
While still often challenging, long reads are proving particularly useful for assembling such repeat-rich insect sex chromosomes. For example, the Pieris macdunnoughii assembly (Table 1) was built using ONT long reads, where polishing with additional short-read data increased complete lepidopteran BUSCOs by almost 3%. Comparing the resolved sex chromosomes in Pieris butterflies of European and North American lineages shows that the fusion event that created the neo-Z chromosome occurred prior to their divergence [50].
These genome resources support this emerging model system for studying maladaptation in  [51], the tiger mosquito [52], the brown planthopper [53], and the red flour beetle [54]. Species representing emerging model systems such as the examples outlined above are expected to similarly build genome-anchored knowledgebases that support and enrich the exploration of the diversity of insect biology and evolution. J o u r n a l P r e -p r o o f groups [55]. Nevertheless, challenges such as working with large repeat-rich genomes or very small specimens from which to extract high-molecular-weight DNA mean that achieving reference-quality standards can still be arduous [12]. The active participation of the arthropod genomics community in the development of standards and provision of guidelines and protocols through initiatives coordinating the scaling up of reference genome generation are helping to overcome many of these challenges [22,56,57]. Gene content completeness and other quality assessments during production and of the resulting chromosome-level assemblies will therefore continue to play a key role in establishing genome resources that best support the development of new model systems and advance understanding of insect biology and evolution.

Declaration of Competing Interest
The authors certify that they have no affiliations with or involvement in any organisation or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

References and recommended reading
Papers of particular interest, published within the period of review, have been highlighted as: • of special interest Mathers TC, Wouters RHM, Mugford ST, Swarbreck D, van Oosterhout C, Hogenhout SA:

Rearranged Autosomes and Long-Term Conservation of the X Chromosome.
Mol This study exemplifies the coming together of establishing a new laboratory-tractable system with the generation of genome resources and extensive functional genomics data to support novel biological investigations. The mayfly genome provided the framework to explore patterns of gene expression throughout its aquatic and aerial life cycle and across different organs, and to identify a core set of genes involved in insect wing development. In this study, the generation of a chromosome-level genome assembly of this charismatic ladybird additionally identified the X chromosome and Y-linked scaffolds by separately resequencing males and females. These resources support the development of the harlequin as a model for studying invasion biology in insects, and, with more than 200 described colour forms, for investigating the genetics of colour pattern polymorphisms.