avidaR: an R library to perform complex queries on an ontology-based database of digital organisms

Digital evolution is a branch of artificial life in which self-replicating computer programs—digital organisms—mutate and evolve within a user-defined computational environment. In spite of its value in biology, we still lack an up-to-date and comprehensive database on digital organisms resulting from evolution experiments. Therefore, we have developed an ontology-based semantic database—avidaDB—and an R package—avidaR—that provides users of the R programming language with an easy-to-use tool for performing complex queries without specific knowledge of SPARQL or RDF. avidaR can be used to do research on robustness, evolvability, complexity, phenotypic plasticity, gene regulatory networks, and genomic architecture by retrieving the genomes, phenotypes, and transcriptomes of more than a million digital organisms available on avidaDB. avidaR is already accepted on CRAN (i.e., a comprehensive collection of R packages contributed by the R community) and will make biologists better equipped to embrace the field of digital evolution.

Avida is the most widely-used software platform for the study of evolution (Ofria & Wilke, 2004).Within this computational framework of open-ended evolution, digital organisms, which are self-replicating computer programs, engage in competition for limited resources such as memory space and central processing unit (CPU) time.The genome of a digital organism consists of a circular sequence of code instructions that is continuously executed by a virtual CPU.Certain instructions within the genome facilitate the replication of the organism, acting as the sole mechanism for transmitting its genetic material to future generations.However, this replication process is susceptible to errors or mutations.When an instruction is inaccurately copied, it is replaced in the offspring genome by a randomly selected instruction from a set of 26 possibilities, distributed uniformly.Other instruction sets are available, for example when working with parasites (i.e., organisms that steal CPU cycles from other organisms-their hosts-to speed up their own replication process).The resulting variation generates competition among organisms as the population grows and uses up available resources.In order to metabolize these resources and thus replicate, organisms must compute Boolean logic operations-such as NOT and NAND-on binary numbers taken from the environment through input-output instructions encoded in their genomes.The organism that is fittest, in the sense of being the faster replicator while encoding the phenotype required to metabolize a resource in a specific environment, comes to dominate the population through natural selection.
In spite of the extended used of Avida, no effort has been made to build and provide access within the scientific community to an up-to-date and comprehensive database on digital organisms.We have developed the avidaR package to provide users of the R programming language (i.e., an open-source statistical programming language widely used for data analysis, statistical modeling, and graphical visualization) with easy-to-use tools for performing complex queries on avidaDB-an ontology-based database containing the genomes, phenotypes and transcriptomes of more than a million digital organisms obtained from evolution experiments carried out in Avida.

DIGITAL EVOLUTION avidaDB
We have built a comprehensive database that encompasses the genomes, phenotypes, and transcriptomes of over one million digital organisms resulting from evolution experiments carried out using Avida.A significant portion of the data, more than half, was obtained by randomly sampling the sequence space of genomes that were 100 instructions long (Fortuna et al., 2017).These genomes encoded the same phenotype within a single environment.Furthermore, in a subsequent study (Fortuna, 2022), the phenotype encoded by the genomes of these 512,000 organisms, as well as their transcriptomes, were computed in a set of 1000 distinct environments.The database includes the following key data from these two experiments: • genome instruction sequence: a linear string of letters representing the instruction codes that make up the genome of a digital organism.
• phenotype: a unique combination of logic operations that a digital organism performs on 32-bit one-and two-binary input numbers.
• transcriptome instruction sequence: a linear string of letters representing the instruction codes that make up the transcriptome executed by a digital organism.
In a third study (Rico and Fortuna, unpublished), we extended the analysis by computing the phenotype and transcriptome of all single-point mutants for 200 digital organisms from the previous dataset.These single-point mutants were generated by introducing a single mutation at different positions in the genome.The computations were performed in the same 1,000 environments as in the previous study.As a result, we obtained an additional dataset comprising 500,000 digital organisms (2500 single-point mutants for each of the 200 organisms).In this expanded dataset, we have stored the same type of data as in the previous studies, including genomes, phenotypes, and transcriptomes.Moreover, we have also included information about the specific genomic locations of all single-point mutations: • mutant of: a relation linking two digital organisms if their genomes differ in a single instruction code.
• position at genome: the position of an instruction code along the digital genome of a digital organism.
In addition to the genomic, phenotypic, and transcriptomic data, we have also computed and stored information on the viability and generation time of the 1,012,000 digital organisms in the same 1,000 environments: • viable: the ability of a digital organism to produce an offspring able to replicate by executing its genome.
• generation time: number of instruction codes executed by a digital organism to produce a viable offspring (i.e., the inverse of fitness in the absence of any selective pressure other than reducing the number of instructions that must be executed for successful replication).

Tracing the provenance and reproducibility of data
Clarifying the provenance of data is crucial for traceability and reproducibility in scientific research.We have added functions into avidaR to trace the context and provenance of the data obtained from Avida experiments: • get_experiment_id_from_organism_id() # establishing a connection between a digital organism and the experiment it was derived from enables researchers to easily retrieve the relevant experiment information associated with that specific organism.
• get_doi_from_experiment_id() # establishing a connection between an experiment and the identifier of the scientific publication it was reported in enables researchers to easily access the publication, which provides additional details on the experimental design.
• get_docker_image_from_experiment_id() # establishing a connection between and experiment and a docker image that contains the configuration files and options used in the experiment enables researchers to reproduce the experimental conditions and obtain repeatable results.

Providing semantics to avidaDB
avidaDB has been implemented as an RDF store (also known as graph database or triple store) to express relationships between resources using subject-predicate-object triples.
The most relevant advantage of RDF stores is that they represent, store and query data as a graph.The second characteristic is that they are semantic, which means that they can store not only data but also explicit descriptions of the meaning of that data.These explicit descriptions are known by the RDF and linked data community as ontologies.An ontology is a machine-readable description of a domain that typically includes a vocabulary of terms and some specification of how these terms relate to each other, imposing a structure on the data for such domain.This is also known as a schema.
We have developed the Ontology for Avida (OntoAvida) to provide semantics to avidaDB.The semantic relationships between the terms commonly used in Avida are expressed in the W3C standard ontology language OWL-DL.OntoAvida is part of the Open Biological and Biomedical Ontologies (https://obofoundry.org/ontology/ontoavida.html) and is already available to the scientific community.We have also customized pyLODE to obtain a readable version of OntoAvida, which is also available at https://owl.fortunalab.org/ontoavida.

Connecting to avidaDB
Semantic databases are databases that store RDF data and allow the querying of RDF data via the SPARQL query language (i.e., a programming language used to retrieve and manipulate data stored as subject-predicate-object triples).The library avidaR can connect to triplestores that support the RDF4J server REST API such as GraphDB.The SPARQL endpoint of avidaDB (https://graphdb.fortunalab.org/)allows both basic connection (requiring no password or requiring basic HTTP user-password authentication) or connection secured with an API access token.The way avidaR connects avidaDB requires user-password authentication (i.e., public_avida as both username and password).
avidaR retrieves data from avidaDB without specific knowledge of SPARQL or RDF.The functions in avidaR can be grouped by the type of data requested: (i) genomes as sequences of letters representing code instructions (e.g., get genomes encoding specific phenotypes in particular environments), (ii) phenotypes as a combination of the logic operations computed (e.g., get the phenotype of organisms that computed specific logic operations), (iii) transcriptomes executed during the replication cycle as both sequences of letters representing code instructions and chord diagrams (e.g., get the transcriptomes of organisms whose genomes encoded specific phenotypes in specific environments), and (iv) single-point mutants at specific positions on the genome (e.g., get the genome of the single-point mutant that substituted the instruction coded by letter i by j at a particular position on the genome of a wild-type organism).Users can build programming workflows by piping multiple functions together so that the output of one becomes the input to another (e.g., get a genome encoding different phenotypes in distinct environments (i.e., plastic organism), and retrieve the generation time as a surrogate of fitness in the absence of any selective pressure other than reducing the number of instructions executed).

R FUNCTIONS
Below are some of the functionalities that can be found in avidaDB: avidaDB access (required to establish a connection to the database) • avidaDB <-triplestore_access$new() # create triplestore object.

avidaDB summary
• get_db_summary(triplestore = avidaDB) # get a summary of the content of the database.

Logic operations computed by digital organisms
• logic_operation() # get the list of logic operations that a digital organism can compute (i.e., the default logic-9 environment).
• get_logic_operation_from_phenotype_id() # get the list of logic operations computed by a digital organism whose genome encodes a specific phenotype out of the 512 distinct phenotypes comprising the entire phenotype space (i.e., ranging from phenotype 0 to phenotype 511).

Genomes harbored by digital organisms
• get_genome_id_from_logic_operation() # get genome identifiers of organisms that computed specific logic operations.
• get_genome_id_from_transcriptome_id() # get genome identifiers of organisms that executed specific transcriptomes in particular environments.

Phenotypes encoded by genomes of digital organisms
• get_phenotype_id_from_genome_id() # get the phenotype identifier encoded by a genome identifier.
• get_phenotype_id_from_genome_seq() # get the phenotype identifier encoded by a genome sequence.
• get_phenotype_id_from_logic_operation() # get the phenotype identifier of organisms that computed specific logic operations.
• get_phenotype_id_from_transcriptome_id() # get the phenotype identifier of organisms that executed specific transcriptomes in specific environments.

Transcriptomes executed by digital organisms
• get_transcriptome_id_from_genome_id() # get the transcriptome identifier encoded by a genome identifier.
• get_transcriptome_id_from_genome_seq() # get the transcriptome identifier encoded by a genome sequence.
• get_transcriptome_id_from_logic_operation() # get the transcriptome identifier of organisms that computed specific logic operations.
• get_transcriptome_id_from_phenotype_id() # get the transcriptome identifier of organisms whose genomes encoded specific phenotypes in specific environments.

Single-point mutants
• get_genome_id_of_wild_type_organisms() # get the genome identifier of the list of organisms for which genomic, transcriptomic, and phenotypic information of all their single-point mutants is available (i.e., wild-type organisms).
• get_mutant_at_pos() # get the genome sequence of specific single-point mutants at a particular position on the genome of a wild-type organism.

CASE STUDIES
We illustrate the use of avidaR by addressing two case studies: (i) quantifying the fitness cost of phenotypic plasticity, and (ii) identifying mutations with positive effects on replication time as a surrogate of fitness.The analysis was performed using avidaR version 1.2.0.

The fitness cost of phenotypic plasticity
In the first case study, we investigate the impact of phenotypic plasticity on the fitness of digital organisms.Phenotypic plasticity is not only a pervasive feature of biological organisms but also of digital ones (Fortuna, 2022).Specifically, the environment can modify the sequence of instructions executed from a digital organism's genome (i.e., its transcriptome), which results in changes in its phenotype (i.e., the ability of the digital organism to perform Boolean logic operations).This epigenetic pathway for plasticity comes at a fitness cost to an organism's viability and replication time: the longer the replication time (higher fitness cost), the more chances for the environment to modify the genetic execution flow control, and the higher the likelihood for the genome to encode novel phenotypes.
We obtain a set of genomes that exhibit phenotypic plasticity, where organisms can switch between different phenotypes in response to changes in their environment.We also collect a set of genomes that do not exhibit plasticity and maintain a fixed phenotype across all environments.By comparing the replication times of organisms with and without plasticity in different environments, we quantify the fitness cost associated with phenotypic plasticity.Here is the code to retrieve the data and perform the analysis.This preliminary finding suggests that plasticity may incur a fitness cost for simple phenotypes, but not for complex phenotypes.This is evidenced by an increased time for digital organisms to replicate, indicating a decrease in fitness (Fig. 1).

The fitness effects of single-point mutations
In the second case study, we retrieve data from avidaDB to identify the single-point mutation that has the largest positive effect on replication rate.We consider the replication rate as a surrogate for fitness, where a higher replication rate corresponds to higher fitness.Replication in digital organisms involves the copying of the genome instruction by instruction into a new memory region, a process that can occasionally result in errors known as mutations.Mutations occur when an instruction is inaccurately copied, leading to its replacement in the offspring genome with a randomly selected instruction.Mutations can have different effects on fitness, including being lethal (resulting in nonviable offspring), neutral (no change in phenotype or fitness), deleterious (decreasing fitness), or beneficial (increasing fitness).
The process of identifying the most significant single-point mutation on replication rate involves multiple steps: • randomly sampling the genome sequence of a single organism (wild-type organism).
• retrieving the transcriptome of the wild-type organism and its single-point mutants in a specific environment.
• selecting the mutant with the highest fitness based on generation time as a surrogate measure.Figure 1 Simple analysis to illustrate the use of avidaR to quantify the fitness costs of phenotypic plasticity.We suggest that plastic organisms whose genomes encode a simple phenotype (i.e., digital organisms able to compute the Boolean operation NOT) have longer generation times (i.e., lower fitness) than non-plastic ones.In contrast, organisms whose genomes encode complex phenotypes (i.e., digital organisms able to compute the Boolean operation EQUALS) show the opposite.Generation time was measured as the number of instructions that a digital organism must execute to produce an offspring.
Full-size DOI: 10.7717/peerjcs.1568/fig- 1 • comparing the genome sequences of the wild-type organism and the fittest mutant to identify the specific mutated instruction and its position in the genome.
• finally, plotting the transcriptomes of the wild-type organism and the fittest mutant for comparison (Fig. 2).
Here is the code to retrieve the data and perform the analysis.A single-point mutation has the potential to significantly impact fitness by altering the execution flow of the instructions within a digital organism's genome.In our analysis, we observed a notable fitness improvement resulting from a specific mutation.Specifically, the substitution of the math instruction n (dec) at position 54 in the genome with the control-flow instruction h (jump-head) led to a 29% increase in the replication rate of the digital organism (Fig. 2).This finding highlights the important influence that individual mutations can have on the performance and fitness of digital organisms.

FUTURE DIRECTIONS
Digital evolution presents a highly promising avenue within the field of ecology and evolution, offering an intermediary level of complexity between real-life systems and traditional mathematical models.By studying digital organisms, researchers gain unprecedented insights into evolutionary processes that are otherwise challenging to explore using natural systems alone.Consequently, digital organisms offer a complementary approach to studying natural or experimental evolution.It is important to note, however, that while predictions from evolving digital organisms can provide valuable insights, there may exist substantial differences in the underlying mechanisms between digital and biological organisms.Nonetheless, the fundamental operational processes, such as Darwinian evolution, remain equivalent.Thus, studies using digital organisms can help us elucidate the principles governing evolutionary dynamics across various scales.
The avidaR library was developed as part of a larger project aimed at providing semantics to the data generated from experiments carried out on the Avida software platform.It represents the first R package that enables users to perform advanced queries on an ontology-based semantic database, which stores the genomes, phenotypes, and The transcriptomes of a wild-type digital organism (left) and its single-point mutant with the largest positive effect on fitness (right) are represented as chord diagrams.The instructions of the 100-length genome that were actually executed are represented as segments around the circle.The size of each segment is proportional to the number of times that the instruction was executed, and its color depicts the type of instruction.Gaps between segments indicate parts of the genome that were not executed.The first instruction of the genome that is executed is placed at the top center of the chord diagram.Some numbers indicating genome positions were placed along the genome to illustrate the changes in the execution flow that took place when the math-instruction labelled as n at position 54 was substituted by the control-flow instruction labelled as h.This single-point mutation, depicted in bold and marked with an asterisk, increases 29% the replication time of the mutant organism.Full-size DOI: 10.7717/peerjcs.1568/fig- 2 transcriptomes of over a million digital organisms resulting from evolution experiments.This integrated framework empowers biologists by equipping them with the necessary tools to explore and analyze the field of digital evolution more effectively.By leveraging the functionalities of avidaR, researchers can gain deeper insights into the evolutionary processes and dynamics of digital organisms.The power of avidaR as a tool for evolutionary analysis is exemplified by two case studies that highlight its capabilities in addressing complex questions related to fitness, plasticity, and genetic variation.The Jupyter Notebook available at the following URL https://gitlab.com/fortunalab/avidaR/-/blob/main/avidaR.ipynb,provides additional examples of the versatile applications of avidaR.In this notebook, you will find detailed examples and code snippets showcasing various features and functionalities of avidaR.It serves as a comprehensive resource for users to explore and understand the capabilities of the package.By working through the provided examples, researchers can gain practical insights into how to effectively utilize avidaR for their own evolutionary analyses and experiments.We will expand avidaDB by including the genomes of parasite organisms from our previous and ongoing research.Parasitic digital organisms are almost identical to any other organism, and as such they compute Boolean logic operations and self-replicate by copying their genome instruction-by-instruction into a new memory space.But they operate inside hosts, stealing CPU cycles from them to execute their own genome's instructions and, hence, reduce their host fitness.
avidaR is a valuable tool to perform studies that lie at the intersection of evolutionary biology and computer science.Its potential transcends traditional academic boundaries and will definitely provide fertile ground for new collaborations bridging these disciplines.Our R package will appeal not only to students of evolutionary biology and computer science, but also to synthetic biologists, systems engineers, students of the origin of life, and philosophers.

Figure 2
Figure2Changes in transcription as a consequence of a single-point mutation.The transcriptomes of a wild-type digital organism (left) and its single-point mutant with the largest positive effect on fitness (right) are represented as chord diagrams.The instructions of the 100-length genome that were actually executed are represented as segments around the circle.The size of each segment is proportional to the number of times that the instruction was executed, and its color depicts the type of instruction.Gaps between segments indicate parts of the genome that were not executed.The first instruction of the genome that is executed is placed at the top center of the chord diagram.Some numbers indicating genome positions were placed along the genome to illustrate the changes in the execution flow that took place when the math-instruction labelled as n at position 54 was substituted by the control-flow instruction labelled as h.This single-point mutation, depicted in bold and marked with an asterisk, increases 29% the replication time of the mutant organism.Full-size DOI: 10.7717/peerjcs.1568/fig-2