Ag Data Commons
Browse
bradysia_coprophila.bcop_v1.0.tar.gz (96.8 MB)

Bradysia coprophila genome annotations Bcop_v1.0

Download (96.8 MB)
dataset
posted on 2024-02-16, 15:39 authored by John Urban

This dataset presents the Bradysia coprophila genome annotations Bcop_v1.0. It will be used as a starting point to manually improve annotations.

The annotations were generated using Maker2. Highly detailed bioinformatic methods information can be found in the supplemental material of our preprint titled, "Single-molecule sequencing of long DNA molecules allows high contiguity de novo genome assembly for the fungus fly, Sciara coprophila" (doi: https://doi.org/10.1101/2020.02.24.963009 ). See the Table of Contents therein. A far briefer description is below. Note that Sciara coprophila is synonymous with Bradysia coprophila, and was used in the title of our publication for historical reasons.

Repeat library used for masking: species-specific repeat libraries were built using RepeatModeler. A more comprehensive repeat library was created by adding previously-known repeat sequences from Bradysia coprophila and all Arthropod repeats in the RepeatMasker Combined Database: Dfam_Consensus-20181026, RepBase-20181026. The comprehensive repeat library was used with RepeatMasker as part of the Maker2 pipeline.

Automated gene finding: To predict/find protein-coding genes, Maker2 was used to take of 3 sources of evidence: RNA-seq expression evidence, homology, and gene prediction. RNA-seq data from both male and female embryos, larvae, pupae, and adults were combined to create transcriptome assemblies using Trinity (de novo) and HiSat2 followed by StringTie (genome-guided). The transcriptome assemblies were used as EST evidence in Maker2. Transcript and protein sequences from related species was used for homology evidence. Three gene predictors were used: Augustus, SNAP, GeneMark-ES. See the supplemental materials in our preprint for more information on iterative Maker2 rounds, training each gene predictor, RNA-seq methods, and transcriptome assembly generation. The Maker2 gene annotations of the final round were evaluated using annotation edit distances, BUSCO, RSEM-Eval, and TransRate.

Functional information: InterProScan was used to identify Pfam domains and GO terms from predicted protein sequences, and BLASTp was to find best matches to curated proteins in the UniProtKB/Swiss-Prot database.


Resources in this dataset:

  • Resource Title: Bradysia coprophila genome annotations Bcop_v1.0.

    File Name: bradysia_coprophila.bcop_v1.0.tar.gz

    Resource Description: Primary file: - Bradysia_coprophila.Bcop_v1.0_gene_set.gff - Contains automated annotations from Maker2 (described in https://doi.org/10.1101/2020.02.24.963009). - This is the main file in this tar archive. - The reference genome fasta is available from GenBank: https://www.ncbi.nlm.nih.gov/assembly/GCA_014529535.1. - The Seqid in Column 1 of this gff3 file corresponds to the 'Sequence name' in the GenBank assembly report: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/529/535/GCA_014529535.1_BU_Bcop_v1/GCA_014529535.1_BU_Bcop_v1_assembly_report.txt Supplementary files: - Bradysia_coprophila.Bcop_v1.0_evidence.rnd3.gff - Contains aligned evidence Maker2 used. - Bradysia_coprophila.Bcop_v1.0_masked_genome.rnd3.gff - Contains coordinates for masked regions of the genome as seen by Maker2. - Bradysia_coprophila.Bcop_v1.0_proteins_with_putative_function.fasta - Contains predicted protein sequences - Bradysia_coprophila.Bcop_v1.0_transcripts_with_putative_function.fasta - Contains predicted transcript sequences

Funding

National Science Foundation: MCB-1607411

National Institutes of Health: GM121455

National Institutes of Health: T32-GM007601

National Science Foundation: EPSCoR #1004057

National Science Foundation: GRFP-DGE-1058262

History

Data contact name

Urban, John

Data contact email

dr.john.urban@gmail.com

Publisher

Ag Data Commons

Temporal Extent Start Date

2013-01-01

Temporal Extent End Date

2016-12-31

Theme

  • Not specified

Geographic Coverage

{"type":"FeatureCollection","features":[{"geometry":{"type":"Polygon","coordinates":[[[-76.625551823527,39.33137246469],[-76.624420434237,39.32985385411],[-76.623266665265,39.330517889657],[-76.623146468773,39.331253300693],[-76.624360503629,39.331666289305],[-76.625551823527,39.33137246469]]]},"type":"Feature","properties":{}},{"geometry":{"type":"Point","coordinates":[-76.624924354255,39.331016171513]},"type":"Feature","properties":{}},{"geometry":{"type":"Point","coordinates":[-73.469181396067,40.860168798284]},"type":"Feature","properties":{}},{"geometry":{"type":"Point","coordinates":[-71.401475393213,41.828731518494]},"type":"Feature","properties":{}}]}

Geographic location - description

USA: Northeast

ISO Topic Category

  • biota

Ag Data Commons Group

  • Insects - i5K

National Agricultural Library Thesaurus terms

genomics; data collection; bioinformatics; DNA; genome assembly; fungi; arthropods; databases; automation; prediction; males; females; larvae; pupae; adults; transcriptome; expressed sequence tags; amino acid sequences; proteins; Bradysia; Sciara coprophila

Pending citation

  • No

Public Access Level

  • Public

Preferred dataset citation

Urban, John (2021). Bradysia coprophila genome annotations Bcop_v1.0. Ag Data Commons. https://doi.org/10.15482/USDA.ADC/1522618

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC