Categorizing Sequences of Concern by Function To Better Assess Mechanisms of Microbial Pathogenesis

ABSTRACT To identify sequences with a role in microbial pathogenesis, we assessed the adequacy of their annotation by existing controlled vocabularies and sequence databases. Our goal was to regularize descriptions of microbial pathogenesis for improved integration with bioinformatic applications. Here, we review the challenges of annotating sequences for pathogenic activity. We relate the categorization of more than 2,750 sequences of pathogenic microbes through a controlled vocabulary called Functions of Sequences of Concern (FunSoCs). These allow for an ease of description by both humans and machines. We provide a subset of 220 fully annotated sequences in the supplemental material as examples. The use of this compact (∼30 terms), controlled vocabulary has potential benefits for research in microbial genomics, public health, biosecurity, biosurveillance, and the characterization of new and emerging pathogens.

there are few terms for describing parasitism of hosts as practiced at the molecular level by bacterial, fungal, and protozoal pathogens. What terms there are have few annotations associated with them. Often, the only hint in UniProt that a sequence might be involved in deleterious host-affecting activities was through the tag "GO:0009405 (pathogenesis)." As of June 2021, this term was associated with over 277,000 UniProt accession numbers. Interestingly, the GO:0009405 pathogenesis term has been scheduled for obsolescence, with the final notice given in March 2021 (https://github.com/geneontology/go-annotation/issues/3452).
SoCs are not limited to organisms and toxins on the select agent lists (1). Simply listing the genes of those microbes and toxins would include tens of thousands of innocuous sequences that these parasites share with their close, but nonpathogenic and even nonsymbiotic, relatives (i.e., false positives). This also neglects sequences that cause damage or enable infection from human-disease-causing microbes not deemed serious enough for inclusion on select agent lists (i.e., false negatives). This minireview offers criteria to identify SoCs based on an analysis of more than 2,750 sequences culled from the professional literature for more than 105 bacterial species, 85 viruses, and 25 eukaryotic pathogens. We describe an approach to better characterize these sequences for bioinformatic applications.

WHAT ORGANISMS ENCODE SEQUENCES OF CONCERN?
Of the hundreds of thousands of species of bacteria, fungi, protozoa, worms, and viruses on the planet, only a small percentage have been documented to cause disease in the primate Homo sapiens. It was estimated in 2007 that ;1,400 microbes and parasites can produce disease in humans. Of these, 541 were bacterial, 325 fungal, 285 helminthic, 189 viral, and 57 protozoal (4). Further studies indicated that ;600 fungi can cause disease in humans (5), and well over 200 RNA viruses can infect humans (6), so the total number of human-disease-causing entities is greater than 1,750 and is probably closer to 2,000.
Parasites are distinguished from closely related symbionts by their expression of specific molecules that, when deployed appropriately, can cause a loss of homeostasis (i.e., disease), in a susceptible host. Particular environmental conditions can dispose a host toward greater susceptibility and a parasite toward greater disease-generating ability (7). While many sequences from human-disease-causing microbes have been examined empirically, "the majority. . .from the microorganisms responsible for the world's most prevalent diseases remain poorly defined and uncharacterized" (8).

MICROBIAL PATHOGENESIS AND VIRULENCE FACTORS
Practitioners of the biological subspecialty of microbial pathogenesis, a hybrid of cellular biology, molecular biology, and microbiology, investigate the sequences by which microbes exploit host organisms. Perhaps the earliest exploration occurred 50 years ago in swine by Williams Smith and Margaret Linggood. They showed that nonpathogenic Escherichia coli could become an enterotoxigenic pathogen with the introduction of plasmids encoding F4 fimbriae and enterotoxin (9).
Testing a mechanism that directly contributes to pathogenesis makes for the most satisfying investigations. In 2007, experiments were conducted using mice of the same genetic background, while the Citrobacter rodentium bacteria used to infect the mice were varied in which set of up to seven effectors they expressed. The authors showed how the set of sequences expressed rendered the pathogen capable, less capable, or incapable of transmission to a new host and more or less proficient at causing lethal damage (10). Unfortunately, there are more than a few papers declaring a gene product a "virulence factor" after experiments show a "decrease in virulence" following deletion of the gene, though no mechanism can be inferred. In the absence of adequate controls, the gene product in question may simply be necessary to the normal functioning of the organism without necessarily affecting the host.
(i) When "virulence factors" are not sequences of concern. The "virulence factor" appellation is rife in the literature. "Factor" covers carbohydrates, lipids, proteins, and combinations thereof, as well as small RNAs. Encoded virulence factors are prima facie candidates for SoCs. However, molecules called virulence factors are not always a threat to a host. Bacterial siderophores are called virulence factors, but most are scavenging molecules without which the bacterium would perish in any environment where metal cofactors are rare. It makes more sense to designate these "virulence lifestyle" sequences (11), or perhaps "proliferative factors." The less-than-discriminating use of "virulence factor" makes it difficult for investigators to discern what sequences actually harm a host (12). Not all virulence factors are SoCs.
Researcher designations of virulence factors are critical for curators to recognize them, but the less-than-thoughtful use of the nomenclature can create problems for bioinformaticians. An analysis of 2,000 purported virulence factors from over 50 bacterial pathogens found that just 620 were specific to pathogens while 1,368 were common to both pathogens and nonpathogens. The 620 pathogen-specific virulence factors were more likely to reside in pathogenicity islands and be secreted via a secretion system (13). In contrast, the 1,368 "common" virulence factors are probably not SoCs. If put into a reference database of "virulence factors," they would be false positives. An adequate system for categorizing SoCs should recognize these differences.
(ii) Existing virulence factor data sets and the importance of manual curation of function. Many databases of virulence factors do not curate their sequences according to an established rubric that allows for the extraction of function. The Virulence Factor Database (VFDB) is limited to bacteria pathogenic for humans. The developers eschew manual curation (14). The data set associated with VFDB includes ;3,400 sequences from ;21 bacterial species. No justification is given for the presence of constituent sequences. No curations keyed to individual sequences are provided. The Pathogen-Host Interaction Database (PHI-Base) captures the genetics of pathogenhost interactions from the primary literature along with some functional details, but it principally notes changes in virulence that accompany genetic variants. The effect that these parasite sequences have on the host are of secondary importance (15). The same is true of the Victors database (16). A comparison of bacterium-related databases suggests that functional annotation of SoCs is not a significant concern (17). We think that manual curation is required to adequately annotate the consequences that SoCs have on host processes and enable further advances in computational biology.

IDENTIFYING AND ASSESSING SEQUENCES OF CONCERN
There is a chicken-and-egg aspect to identifying SoCs. One must have some idea of what microbial features might be threatening to know what to examine, but it is not until "enough" sequences are perused that the important aspects can be recognized categorically. By reviewing the literature, we discovered sequences that appear important to pathogenesis for parasites of humans, as well as those of animals and plants necessary to human well-being. We have documented over 2,750 of these, which we hope is a fair sample to develop a conceptualization for understanding biothreats. Assessing sequences of concern for their danger in a bioengineering, gain-of-function (GoF) scenario required us to consider two parameters: (i) the effect on the host, including which host processes are manipulated, and (ii) how directly the sequence exerts its effects. For this minireview, we limit ourselves to reviewing functions of SoCs (FunSoCs) from microbes targeting mammals. The FunSoCs are summarized in Fig. 1 and discussed below. Included as supplemental material is a table of short definitions for the FunSoCs (Data Set S1) and a spreadsheet (Data Set S2) with 220 sequence types from 60 pathogenic species (bacterial, fungal, protozoal, viral) annotated with UniProt accession numbers, FunSoCs, and PubMed identifiers to illustrate our curation.
(i) What is the effect of the SoC on the host? (a) Host damage as the sine qua non of pathogenicity. It is generally true that lethal infections are deadly because one or more organs become disabled from cumulative damage. Ascertaining the proximal cause of damage can be problematic. Host damage can be the direct result of the parasite's action on the host, the host's reaction to the parasite, or both. While infectious disease theorists of the 20th century once credited the pathogen with unique disease-causing ability, this is no longer tenable (18)(19)(20).
Damage is the hallmark of pathogenicity (21). Since this is the case, "toxins" might be said to occupy the preeminent place among virulence factors since they are among the most damaging of molecules deployed by pathogens. In bacteria, toxins are distinguished from damaging effectors in that the former are capable of mediating their own attachment and invasion into a cell, while effectors must be secreted (22).
The term "toxin" is notably nonspecific and amounts to little more than a verbal tag that a molecule is inimical to the life of one or more taxa. But the taxa susceptible to the toxin need to be understood. Alpha-amanitin, bicuculline, carbon monoxide, chlorine gas, ciguatoxin, cyanide, MARTX from Vibrios, ricin, and sarin have disparate modes of action and are all deadly to mammals if administered appropriately. In contrast, the toxins of toxin/antitoxin (TA) systems are not hazardous for mammals; however, they might be administered (23). Of course, toxins do not exhaust the range of damaging biological sequences. The following paragraphs attempt to categorize host damage caused by SoCs.   (94). This class of effectors may require splitting into microbially induced versus microbially provoked host inflammation.
(b) Immune subversion as an essential condition for pathogenicity. Stanley Falkow observed that the avoidance of host defense mechanisms was a feature of diseasecausing bacteria (95). Sequences that subvert innate immune pathways are also found in fungal and protozoan parasites and are a universal feature of viruses. Immune systems embody the "wisdom" of hundreds of millions of years of adaptation over which they have had to detect, deflect, and defeat micro-and macroparasites (96)(97)(98). More than 6% of all human genes have a role in immunity (99). Immune systems impose layers of molecular and cellular obstacles to thwart invaders that breach epidermal barriers. Parasites survive these host stratagems by employing molecules that mask their presence, mimic and/or misdirect host responses, or simply eliminate immune effectors. Of the SoCs that we documented, ;60% of the viral sequences and ;20% of the bacterial and eukaryotic sequences subvert host immune responses.
Deficits in immune detectors and effectors of a host can render commensal symbionts pathogenic and infections with "nuisance" organisms lethal. Subtle changes in the sequence of a single host immune effector molecule can mean the difference between life and death during challenge with a parasite (100). The study of human immune deficiencies shows the critical importance of these components of innate immunity for defense against the specific, usually narrow, set of parasites against which they defend (101-103). Many infections run their nonlethal course according to the life cycle of the parasite when facing an average host immune response. These are sometimes called "self-limiting" infections, but a defect or deficit in a host immune component can abolish the limitation and produce a life-threatening disease.
Of the ;2,000 parasites that can cause disease in humans, the majority are opportunistic: limited to infecting immunocompromised persons (4,5). The "opportunity" occurs when a proto-parasite encounters an individual whose immune defenses are diminished from (i) loss of barrier function, (ii) congenital immune defects, (iii) infection with HIV, (iv) immune-suppressing pharmacotherapy, or (v) other disease states that alter the homeostasis of the host. These render the host susceptible to microbial parasites that could not successfully establish themselves otherwise. SoCs mediating immune subversion essentially make a host susceptible in the absence of a compromised immune system. Some immune-evading SoCs from Streptococcus are shown in Fig. 2 (294,295), and secreted phospholipase A2 (Sla) (296). Neutrophil extracellular traps (NETs) are countered by the Sda1 and SpnA nucleases (264,265). Antimicrobial peptides are inactivated by the secreted streptococcal inhibitor of complement (Sic) and SpeB proteases (200,201). M-like proteins bind host factor H and plasminogen/plasmin, which inactivate host complement components to protect the bacterium (297). Sic protects streptococci from phagocytosis by neutrophils, resists the host complement membrane attack complex (MAC) (70), and counters the antibacterial actions of the host secretory leukocyte proteinase inhibitor (SLPI) (200,201). Host antibodies are destroyed by membrane-associated ZmpC (226) and the secreted IdeS proteases (222) and inactivated by sugar-cleaving EndoS (223). The group B Streptococcus C5a peptidase ScpB is a serine protease and surface invasin (298) that reduces the neutrophil response and bacterial clearance by cutting the chemoattractant C5a (299). The streptococcal complement protector ScpA helps the bacterium resist phagocytosis (183) and also inactivates C5a (300). SpyCEP eliminates the neutrophil chemoattractant IL-8 (230) and other chemokines (225). Note that this figure depicts SoCs found in both group A and group B streptococci for illustrative purposes, but they would not naturally occur together. (111) (191), or inactivated indirectly, as by CipA from Acinetobacter baumannii, which recruits host plasminogen to the bacterial surface (192). BclA of B. anthracis mediates serum resistance by recruiting factor H, a host complement control protein, to the bacterial surface (193). 4. Resistance to antimicrobial peptides. Host antimicrobial proteins are cationic peptides that interact with the negatively charged bacterial membrane. They can be destroyed by bacterial proteases, including OmpA from Klebsiella (194), ClpX from B. anthracis (195), CPAF from Chlamydia (196), staphylokinase from S. aureus (179), SepA from Staphylococcus epidermidis (197), DRS (198), SspA, SspB (199), SpeB, and Sic from Streptococcus (200,201), and OmpU from V. cholerae (202 (c) Adherence to the host cell. To affect the host, symbionts need to either secrete toxins that act while the microbe is at a distance from the host cell or contact host cells or tissues directly. This requires specific adhesin molecules that anchor them, however durably, to the host. Toxins also require adhesins to recognize target cells. Adherence can be to specific host protein receptors, to carbohydrate moieties of glycoproteins or glycolipids, to membrane cholesterol, and/or to components of the host extracellular matrix. Such proteins are abundant, and host attachment is often just one of their functions (272,273).

110), vaginolysin from Gardnerella vaginalis
(d) Dissemination in the host. Dissemination factors enable the breaching of host barriers. A breach can happen by proteolytic digestion of tissues or the release of junctional adhesins to allow parasite passage. SoCs that degrade tissue can also be dissemination factors. Examples include ExoS and ExoU from P. aeruginosa (274), InhA from B. anthracis (39, 275), and staphylococcal exfoliative toxins (50, 55, 276).
(e) Host cell invasion. A microsymbiont can "enter" a host cell easily when the host cell is a professional phagocyte, but this happens under conditions unfavorable for symbiont survival. Invasins mediate microbial entry into a range of host cells, including nonphagocytic ones, in ways that allow the parasite a greater probability of reproductive success. Bacterial toxins also possess invasive subunits that enable their entry into host cells; this distinguishes them from effectors, which require a secretion system (22).
(f) Movement in host cell. Movement within a host cell allows a parasite to circumvent host barriers and avoid programmed defenses. Some intracellular bacteria, as well as vaccinia virions, hijack host actin polymerization to propel themselves into adjacent cells. They thus avoid exposure to the hazards of the extracellular milieu (277).
(g) Niche creation in host cells. Some cellular microbial symbionts manipulate host cell processes to create intracellular niches, where they are protected from host destruction and in which they replicate. This has been investigated most thoroughly in Brucella, Chlamydia, Coxiella, Ehrlichia, Legionella, Listeria, Mycobacteria, and Salmonella. SoCs from these bacteria are generally secreted and subvert the normal endosomal and cytoskeletal dynamics of the host cell. Sorting out the mechanisms for these effectors-there are hundreds just in Legionella-is exceedingly complicated, as many are redundant (278).
(ii) How directly does the sequence exert its effect? When considering the ease with which the disease-causing capacity of a pathogen might be enhanced by sequence addition/gain-of-function (GoF), it is important to consider how directly the SoC acts on the host. SoCs that act independently without the need for extra (i.e., secondary or tertiary) sequences would affect virulence more parsimoniously. There are at least four levels of SoC involvement in pathogenesis.
1. Type 1 sequences that directly interact with host molecules to contribute to disease are the most concerning. The SoCs described above (i.e., damage, immune evasion, adherence, invasion, movement, dissemination, niche creation) act directly to produce a specific deleterious effect. 2. Type 2 sequences make or modify molecules that affect the host. These include toxin synthases, enzymes that make capsules rendering bacteria resistant to phagocytosis, and "passive immune evasion" enzymes which alter microbial molecules to protect the possessor from host recognition and/or immune effectors. Examples of the latter include AlmG, a peripheral membrane aminoacyl transferase from V. cholerae that modifies lipopolysaccharide to resist host cationic antimicrobial peptides (279), and Cbu0678 from C. burnetii, which changes the O antigen of lipopolysaccharide (LPS) to decrease immune recognition (280). 3. Type 3 sequences are secretion system components that transport directly acting SoCs to the correct location for function. These include chaperones for the effector proteins. 4. Type 4 sequences are transcription factors regulating the expression of sequences that produce effects directly. While they can be very important for the virulence of a microbe and greatly influence how pathogenic a specific microorganism can be, they might be replaced in a GoF scenario by similar factors.
(iii) What host cellular process is affected? We found it helpful to annotate SoCs with the host processes that they modulate, as these can often be discerned before the biochemical mechanisms are discovered. No fewer than nine aspects of eukaryotic host cell biology are targeted by parasite proteins for manipulation: (1) transcription, (2) translation, (3) the cell cycle, (4) apoptosis, (5) ubiquitination, (6) small GTPase dynamics, (7) cytoskeleton dynamics, (8) endomembrane, dynamics, and (9) autophagy/xenophagy. Viruses tend to manipulate the first five processes, while bacteria, particularly intracellular parasites, affect the final six, with overlap at apoptosis and ubiquitination.

DISCUSSION
Gauging the risks of an emerging pathogen strain or one created through microbial engineering (accidental or otherwise) requires a good comprehension of the pathogenic possibilities of SoCs from natural parasites of humans and livestock. An assessment of existing controlled vocabularies revealed a gap for sequences from nonviral parasites. We documented the role played in disease of over 2,750 parasite proteins from thousands of papers. These were annotated with the FunSoC schema, which categorizes their host-affecting features. The 220 sequences mentioned in this text are provided with full annotations in Data Set S2 in the supplemental material, with definitions provided in Data Set S1.
FunSoCs are tidy enough for human comprehension. For a given SoC, they provide a quick assessment for ;30 host-affecting functions. However, they are insufficiently granular for capturing the molecular details necessary for a comprehensive appreciation of function. We think that these details are better understood with a new adjunct to GO, Pathogen Gene Ontology (PathGO). This resource is being developed by a group at the Johns Hopkins University Applied Physics Laboratory and consists of ;180 terms (https://github.com/jhuapl-bio/pathogenesis-gene-ontology). These are being rooted in biological process and molecular function terms of the Gene Ontology resource (281,282). We have been suggesting terms and contributing annotations during development. Data Set S2 features a preview of PathGO terms in column F, along with the relevant PubMed ID accession numbers as citations. PathGO will be described in a future publication.
(i) The utility of gain-of-function experiments in microbial pathogenesis. Sometimes eliminating a bacterial sequence suspected of involvement in pathogenicity has no effect. Legionella pneumophila exhibits so much functional redundancy in its effectors that the loss of one or two sequences of a certain type may not affect the phenotype (283). Investigators of bacterial adhesion face a similar situation when the suspected adhesin originates in a microbe with multiple ways of associating with a target cell. Researchers circumvent this by studying the adhesin in the background of a specially selected "nonadherent" bacterium (284)(285)(286)(287)(288)(289). Experiments in which a sequence "adds" virulence to commensals or avirulent microbes is more interpretable than attempts to ascertain virulence by subtraction from a pathogenic background. The former involves a GoF for the avirulent microbe.
Only a few efforts to make bad bugs worse intentionally have been described (290). However, there are hundreds of publications relating the expression of one or more sequences from an infectious parasite in a heterologous organism. Two dozen of these are noted in column E of Data Set S2. Altered organisms typically display a new property consistent with the suspected pathogenic function of the sequence in the original organism. These GoF experiments are illuminating but can also be problematic (291,292). The role that a sequence plays in the pathogenicity of a microbe can depend on other proteins and/or the timing of its expression. Simply expressing the sequence in another microbe, even a similar one, is no guarantee that it will perform similarly. The question can be settled only empirically within the limits of the model. The most dramatic example of a GoF experiment with biothreat implications is the notorious mouse interleukin-4 (IL-4) expression in Ectromelia virus that was astoundingly lethal in even vaccinated animals (293). An intriguing bacterial example involves the secreted protease SpyCEP of group A Streptococcus. When the nontoxic SpyCEP was expressed in the nonpathogenic bacterium Lactococcus lactis, it rendered the cheese-making firmicute capable of infection in a mouse leg wound model. The SpyCEP protease degrades the chemokine interleukin-8, which host neutrophils use to coordinate their defense, "sniffing out" bacteria within infected tissues. Interruption of this coordination produced a systemic disease that had lethal consequences for the host within 24 h of inoculation (230).
(ii) Recognized criteria for sequences of concern improve biosecurity. For those worried about either the accidental engineering of pathogens via synthetic biology or the production of bioweapons with enhanced efficacy, a concerning sequence is one that, when transferred to a different microbe, increases the ability of that microbe to damage a susceptible host, increasing the pathological consequences of infection. But, as the cases of SpyCEP and murine IL-4 demonstrate, the disease-causing properties of microbes have interesting dependencies that cannot be understood in the absence of experiments. We think that the criterion of enhanced pathogenicity upon expression in a heterologous nonpathogen is a good starting place for identifying SoCs, but most will not be discovered through such GoF experiments. Our annotation project has demonstrated that there are thousands of microbial sequences that can reasonably be assumed to enhance the pathogenic ability of a heterologous microbe if transferred. In such cases, the disease-causing properties of these sequences are described in the context of the original pathogenic organism and not in a heterologous, nonpathogenic microbe. We assume that these sequences may retain their properties if transferred to a similar microbe. At the very least, it does not seem responsible to assume that they would be innocuous. Documenting these sequences enables them to be recognized via bioinformatics and thus improves biosecurity for those involved in the manufacture of synthetic nucleic acids (2).
Toxins and microbial effectors that damage the human host are of greatest concern. Among these, SoCs that provoke organ failure have the most severe consequences. Next in importance are sequences that subvert host immunity. Noting the host cellular process(es) with which a SoC interacts and how directly it affects host molecules allows a better understanding of its role in microbial pathogenesis. Formalizing these criteria improve recognition of SoCs from the literature, provide the means for distinguishing them by function, and permit the reporting of these functions in bioinformatic applications. We think that the FunSoC vocabulary and data sets annotated with it can be a resource for computational epidemiology, microbial genomics and forensics, DNA synthesis screening, human disease modeling, and biosecurity assessment.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.