A Gene-Based Algorithm for Identifying Factors That May Affect a Speaker’s Voice

Over the past decades, many machine-learning- and artificial-intelligence-based technologies have been created to deduce biometric or bio-relevant parameters of speakers from their voice. These voice profiling technologies have targeted a wide range of parameters, from diseases to environmental factors, based largely on the fact that they are known to influence voice. Recently, some have also explored the prediction of parameters whose influence on voice is not easily observable through data-opportunistic biomarker discovery techniques. However, given the enormous range of factors that can possibly influence voice, more informed methods for selecting those that may be potentially deducible from voice are needed. To this end, this paper proposes a simple path-finding algorithm that attempts to find links between vocal characteristics and perturbing factors using cytogenetic and genomic data. The links represent reasonable selection criteria for use by computational by profiling technologies only, and are not intended to establish any unknown biological facts. The proposed algorithm is validated using a simple example from medical literature—that of the clinically observed effects of specific chromosomal microdeletion syndromes on the vocal characteristics of affected people. In this example, the algorithm attempts to link the genes involved in these syndromes to a single example gene (FOXP2) that is known to play a broad role in voice production. We show that in cases where strong links are exposed, vocal characteristics of the patients are indeed reported to be correspondingly affected. Validation experiments and subsequent analyses confirm that the methodology could be potentially useful in predicting the existence of vocal signatures in naïve cases where their existence has not been otherwise observed.


Introduction
Aside from diseases that affect the biological structures and processes involved in voice production, myriad other factors are known to influence voice. Some simple examples include age, exhaustion, smoking, which often makes the voice sound hoarse, or alcohol, which makes the voice sound slurred. The ensuing changes in the voice signal can be thought to be the "biomarkers" that give us information about the corresponding causative factors and allow us to infer their nature through voice analysis. Such relationships form the basis for artificial-intelligence (AI)-based voice profiling techniques that attempt to deduce a speaker's bio-relevant and environmentally related parameters from voice. However, virtually all research on voice profiling, diagnostics, and biometrics is currently predicated on clinically observed or statistically inferred relationships between changes in voice and the corresponding factors that are thought to cause them. The relationships that are chanced upon in this manner provide the basis for building predictive AI (machine-learning-or rule-based) mechanisms that can deduce the underlying factors that potentially influence voice through voice analysis.
For example, it is known that smoking affects voice. To establish this, a humanobservation-based approach would be: (a) an audiological one based on hearing the voices of smokers to determine if they deviate from those of non-smokers in an acoustic sense, and/or (b) a visual one where an analyst studies the spectrogram (or some other visual representation) of the speech signal to find patterns that distinguish one class of recordings from another. The spectrogram in this case is a "feature representation". A statistical approach, on the other hand, would gather examples of speech recordings from people who smoke and those who do not, extract feature representations from the recordings, and find significant differences in the statistics of these features obtained from the two sets of recordings. Alternatively, a classifier model may be trained to discriminate between voice samples from smokers and non-smokers. If high test accuracies are achieved in this task, the existence of a biomarker for smoking in voice is indicated. This is a purely data-driven approach for establishing the existence of biomarkers in voice.
The problem with these approaches is that neither is scalable. The number of factors that can influence the human persona is virtually infinite. Human observations are limited to the effects that are perceptually discernible in voice, and data-based discovery is confined and limited by the availability of representative data. This paper provides a more formal methodology for establishing the existence of biomarkers and for identifying which factors are likely to affect voice and which are not. The methodology is based on genomic considerations, as explained below.
Before we proceed, however, it must be noted that this is not an algorithm for computational biology research. Its use to aid biological discovery or establish biological facts in its current form has not been tested. It is only meant to establish the tentative links that are needed to justify computational profiling efforts.

A genomic-Based Approach to Detect the Existence of Biomarkers
The working hypothesis for this paper, and one that has also recently been proposed in the context of voice profiling [1], is that if a given factor exerts an influence on the speaker, and if pathways of biological effects can be traced from that influence to the speaker's voice production system, then voice must be affected (and must carry biomarkers for the factor). The methodology proposed herein is a literal test of this hypothesis in that it traces biological pathways from cause to effect to establish the existence of biomarkers.
For this, we begin with the genetic underpinnings of human vocal capabilities. In this context, it is important to differentiate between voice production and speech production. The former refers to the production of acoustic energy in any form within the vocal tract, and the latter refers to the modulation of the acoustic signals thus produced to form words and sentences in a language used for interpersonal communication. We will use the term "vocal production" to refer to both.
Vocal production in humans is a complex and multifaceted process and involves interactions between multiple genes and environmental factors. The genetic basis of vocal production is not fully understood. Nevertheless, a number of genes have been found to be involved in the process. Some that are now known to influence vocal production include:

1.
FOXP2: This gene codes for a protein called "Forkhead Box P2", which is involved in the development and function of the brain, including the areas responsible for language and speech.

2.
TAFT: This gene codes for a protein called TAFT1, which is involved in the development and function of the larynx, a structure in the throat that is involved in vocal production.

3.
OTOF: This gene codes for a protein called Otoferlin, which is involved in the development and function of the auditory system, including the inner ear. Feedback from this system greatly influences vocal production.

4.
MYO15A: This gene codes for a protein called Myosin XVa, which is involved in the development and function of the auditory system, including the hair cells in the inner ear.

5.
SEMA3A: This gene codes for a protein called Semaphorin 3A, which is involved in the development and function of the auditory system, including the auditory nerve.
Recent efforts to identify and delineate the genes responsible for functional speech in humans have especially highlighted the importance of FOXP2, one of the protein-coding genes mentioned above. It is involved in a variety of biological pathways and cascades that are thought to regulate language development. It is autosomal dominant, and mutations in it cause speech and language disorders (OMIM: SPCH1). In this paper, we choose FOXP2 as an example gene to work with. This choice was only made for illustrative purposes. The methodology presented in its context is itself a generic one and can be applied to any of the genes listed above (and possibly others that exist) whose functions are relevant to the analysis at hand.
For illustrative purposes, we proceeded with the broad and simplifying assumption that any influence on speech and language is ultimately the phenotypic expression of FOXP2. The objective was thus to: • Formalize the methodology to find a link between an influencing factor and this gene; • Validate the methodology; • Demonstrate its predictive potential.
To accomplish these goals, for simplicity, we chose the example of a category of medical conditions for which the underlying genetic causes are known; the effects of the conditions are observed and reported in medical literature, and the effects can involve problems with vocal production.
The medical conditions we choose are chromosomal microdeletion syndromes, which result from the deletion of specific genes in specific cytogenetic locations on human chromosomes. We propose an algorithm to find a link between the genes in the regions of microdeletions, to FOXP2. The connections in these links are derived using a path-search algorithm applied to a graph composed from known biological pathways that involve these (and other) genes. The "strength" of these connections is then defined in terms of the characteristics of the linkage discovered. In the validation stage, our goal is to show that there is a direct correlation between the strength of linkages found and the extent of vocal symptoms experienced by the affected individuals.
Before we proceed with the algorithm, in the paragraphs below, we first provide a working categorization of speech disorders, as reported in various medical literature. This is necessary for the clarity of the results presented later in this paper.

Anomalies in Speech Production
From a bio-mechanical perspective, human speech is the result of two complex processes that happen simultaneously: one that produces sound-the pressure wave that we sense as the voice signal-and another that modulates this signal (through articulator movements) to produce speech, thus altering the voice signal's frequency characteristics and shaping it into sounds with unique identities that are uttered sequentially to form words and sentences in a language. The overall process of voice and speech production is driven and controlled by neuromuscular and cognitive factors to different degrees. It is also moderated to different degrees by feedback obtained through auditory pathways. Generally, diseases that affect these functions, naturally also influence speech and alter the characteristics of the voice signal to proportional degrees. In most cases, when reporting such changes, references to "speech" implicitly include voice, as we see below.
Changes in speech are categorically described in terms of six major aspects of speech production: respiration, phonation, articulation, resonance, evolution, and prosody. In addition, terminology that relates to voice quality is often used to describe speech. Voice quality is, however, a subjective term and comprises many constituents, or sub-qualities (e.g., nasal, breathy, rough, twangy, etc.), that refer to the perceptual flavor of speech (or how a speaker's voice sounds to the listener). Physical anomalies that affect the shape and tissue structure of the vocal tract cause changes in all of these aspects. Speech delays and language difficulties result from cognitive and learning disabilities. These and other intellectual disabilities affect articulation, evolution, and prosody. Their effect on voice also manifests as changes in voice quality. Craniofacial anomalies affect the physical dimensions of the vocal tract structures, often restricting the movement of the articulators as a result and causing speaking impairments. Motor problems affect articulation, phonation, and respiration. These cause speech aberrations and also affect voice quality. Hearing problems disturb the feedback mechanisms involved in controlling speech production and often lead to difficulties in prosody and articulation of speech.
In this paper, we do not focus explicitly on voice acoustic or quality characteristics, focusing instead on problems with speech (that subsume voice characteristics to some extent) as described under the OMIM (referring to the catalog Online Mendelian Inheritance in Man: https://www.omim.org/ (accessed on 21 September 2022)) category "Speech and Language disorders (SPCH1)". Even this category, however, is too broad and encompasses a wide range of speech problems, such as delays in acquisition of speech abilities, retardation in speech development with age, speech anomalies resulting from language delay, expression, and articulation.
For the purpose of this paper, it is necessary to make finer distinctions between these categories. The problem in doing so is that the language used in the literature to describe voice-and speech-related problems in the context of genetic syndromes is not standardized. For example, the terms "speech disorder", "speech disturbance", "speech anomalies", "speech aberrations", and "speech impairment" may each refer to a range of symptoms that may be overlapping to various degrees. For the purpose of this paper, it is therefore useful to map the broad range of speech problems into the following categories that are sufficiently discriminatory in terms of the different aspects of speech production mentioned above while still being limited in number:

1.
Absence of speech: Phrases (in clinical/scientific literature) referring to (a) no development of speech capabilities, (b) no expressive speech, which is mostly limited to vocalizations, or (c) almost absent speech with a severely limited vocabulary (0-4 words).

2.
Apraxia: Phrases referring to difficulty using language correctly while speaking, leading to speaking and communication difficulties.

3.
Delayed speech: Phrases referring to developmental delay, the retarded development of the ability to speak, or the retarded acquisition of language skills and communication skills (ability to use a vocabulary correctly to communicate in a cogent manner).

4.
Dysarthric speech or dysarthria: Phrases referring to speaking problems resultant from damaged, paralyzed, or weakened muscles of the articulators caused by motor problems. Dysarthria results in slurred words, poor phonation, etc. The speaker uses vocabulary as in normal speech but finds it difficult to move the articulators (tongue, lips, jaw, etc.) correctly to form the proper sounds to utter the words.

5.
Idiosyncratic speech: Phrases referring to poor conformance to cogent language or incoherent language with articulation abnormalities. 6.
Impaired speech: Phrases referring to poor articulation and phonation, as well as difficulties that result in sparse and disfluent speech.

Methodology
As mentioned earlier, we consider the example of syndromes resulting from chromosomal microdeletions and focus on their symptoms relating to speech abilities.
Chromosomal microdeletions are structural anomalies of chromosomes in which small sections of a chromosome are deleted or missing. The loss of the specific set of genes from the deleted section often results in phenotypic changes. An implicated gene is a gene in the deleted region of a chromosome that is known to cause much of the observed effects of the syndrome in affected individuals. These are identified through microarray and other studies. For most microdeletion syndromes, some such genes have been identified and reported in the medical literature. We use this information from the medical literature as-is in the sections below.
Although the human genome is very large in comparison to a typical chromosomal microdeletion region, microdeletions often cause serious problems. In fact, only a small set of deletions are compatible with life or fetal survival. This set continues to expand with the addition of newly discovered deletions in surviving individuals who have the means to reach genetic testing facilities. However, it is still a very small set and can be exhaustively studied. Most known deletions are well documented in the literature, both from the genetic and medical perspectives. Information about the genes associated with them is readily available through well-curated publicly accessible repositories. Thus, they are good example cases for this paper.
The methodology proposed herein analyzes ensembles of biological pathway chains, each of which connects a specific gene in the cytogenetic region of chromosomal microdeletion to the FOXP2 gene. A "biological pathway" here is defined as in standard terminology, referring to a physiological process at the cellular level that is enabled by the action of multiple genes that perform specific functions within the process.
We define a pathway chain as the sequential linkage of pathways where links between pathways are shared genes (implicitly meaning that the molecules resultant from genes are shared-we will use the term "gene" with this implicit meaning in the context of pathways for generality going forward). For example, consider a pathway that signals for a cell to stop dividing when an injury to the nuclear DNA strand is being repaired. It may involve the coordinated chemical action of molecules that are formed by the transcription of multiple genes that perform different functions. It would also be connected to a repair pathway by necessity. Thus, the two pathways can be considered to be links in a single pathway chain (they must share some genes in a functional sense) that perform the function of relaying messages from one pathway to another. Such genes may also perform other functions that are essential to both pathways.
The hypothesis we make here is that for a gene, if a chain from its pathway(s) (i.e., from the biological pathways it contributes to directly) extends to pathways that influence voice production, then the phenotype resultant from the absence or aberrant functioning of the gene can be expected to include anomalies in speech production and voice characteristics.

Voice Chains
Our definition of a voice chain extends our definition of a pathway chain in that the head of the chain must now necessarily be a pathway that includes a gene that influences voice or speech production, while the termination of the chain is not necessarily a biological pathway but could include any given set of genes with a common characterization (such as a common cytogenetic location or function).
In this paper, the voice-related gene chosen is FOXP2, but in other analyses, voice chains could involve other genes (e.g., as in [2]) without loss of generality. The terminal link in the chain is taken to be a genetic microdeletion syndrome. A voice chain thus establishes a relationship between an influencing factor-a genetic microdeletion syndrome in this case-and a corresponding effect on voice/speech production. We refer to a voice chain that includes a sequence of α pathways from the microdeletion region to the FOXP2 gene as a level-α voice chain ( Figure 1). The specific genes on the microdeletion that link it to the voice chain are referred to as "chainlink" genes. We represent the set of chainlink genes that connect a microdeletion to level-α voice chains as the chainlink set V Nα . Since we aim to analyze the genetic basis of the effect of microdeletions on voice, these chainlink sets are the focus of our analysis. Note that the subscript N in V Nα denotes the manner in which overlaps or linkages between biological pathways are defined. This subscript is fixed for the purpose of this paper, for which the exact manner in which pathway overlaps are defined is described in Figures 1 and 2 below. We however leave the N in place to facilitate future differentiations and variations of the proposed algorithm based on how the pathway intersections (or unions) may be defined. In this reversed perspective ideogram, a link depicts a pathway, the black dots on it are genes, and the chromosomes that they belong to are shown as rods where relevant. The lines connecting genes are only meant for visual clarity. Chains are formed with respect to genes on a chromosome (a microdeletion region in this case, shown shaded in yellow). In this ideogram, Gene-1 and FOXP2 lie on the same pathway, contributing to a level-1 voice chain. In a level-2 chain, FOXP2 and the microdeletion region are on different pathways, but the pathways share a set of genes. Gene 2 and Gene 3 have level-2 voice chains.
In order to trace the genetic links between a microdeletion syndrome and voice, we first attempt to identify voice chains of different lengths that link to the genes in the microdeletion region. For this, we must find voice chains that link the FOXP2 gene to the syndrome, and identify the specific genes from the syndrome through which they are linked. We do so using the graph-search algorithm described below.
From our perspective, a biological pathway B is represented by the set of genes it involves: B := {g : g is a gene in the specified pathway}. Two pathways, B 1 and B 2 , are linked if there are genes that are common to both pathways, i.e., B 1 ∩ B 2 = ∅. Thus, the set of all pathways can be represented as a graph where the nodes are biological pathways, and two nodes are linked only if the corresponding pathways have common genes, as illustrated in Figure 2a.
A pathway chain is any non-repeating sequence of pathways B 1 B 2 B 3 · · · B N such that B i ∩ B i+1 = ∅ and B i = B j for i = j, i.e., where every pair of adjacent pathways has common genes, and there are no closed loops in the chain. In terms of the graph (see Figure 2a), a pathway chain is any path between any two nodes in the graph. A voice chain V is any chain V = B V B 2 B 3 · · · B N S where the head node B V (and the head node alone) is a pathway that includes the FOXP2 gene, i.e., FOXP2 ∈ B V , and the terminal node S is a set of genes with common characterization, as mentioned earlier. The length of the chain |V | is the number of nodes α in the chain, not counting the terminal node S. For the purpose of this paper, we will assume S to be the set of genes in a microdeletion region associated with a syndrome. Thus, S = {g : g is a gene in the microdeletion region}. All voice chains of length α form the set of level-α voice chains, and the chainlink genes that connect S to the level-α voice chains form the chainlink set V Nα . Explains what a chainlink gene means in the context of a microdeletion region and the target voice gene (FOXP2 in the example used in this paper). To compose voice chains, the set of genes of interest (e.g., from a microdeletion region) is added to the a graph as a node (shaded pink). The "chainlink" genes that link the set to the graph are also shown. The topmost (left) node represents a biological pathway that contains the FOXP2 gene (g 1 , highlighted in yellow). (c) A Level-1 voice chain is the edge shown in red. The "chainlink" gene g 2 that links the set (in pink) to the voice chain is also shown.
To find voice chains of the form B V , B 1 , · · · S arising from the microdeletion region S (which we will refer to as the "syndrome" for brevity), we introduce the microdeletion region in the pathway graph ( Figure 2b). Voice chains are now the paths from B V to S (Figure 2c,d). A breadth-first algorithm, described in Algorithm 1, is used to extract the chainlink sets V N α for voice chains of multiple levels. The outcome of the algorithm is the set of chainlink genes V Nα [S] that connect each syndrome S to voice chains of level α, for 1 ≤ α ≤ 2. We restrict ourselves to chains of lengths of up to 2 since, at greater lengths, the chained influences cannot be disambiguated, as indicated by prior studies in the (highly related) context of protein-protein interactomes, e.g., [3]. Another reason for restricting ourselves to level 2 chains is that for the specific example chosen in this paper, there are not enough data that allow us to build deeper chains meaningfully (without resorting to self-loops, which may lead to incorrect conclusions).

Algorithm 1:
Pseudocode for a breadth-first algorithm for computing the set of chainlink genes that form level 1 and level 2 voice chains for FOXP2.

Ensemble Analysis
In the methodology we propose, for any microdeletion region S, we derive the set of chainlink genes within it for which α-level chains exist. The size and composition of this set can then be used in conjunction with the level of the voice chain to indicate the effect on voice (in a later analysis). In general, we can work with any level-α voice chains in such an analysis; however, we restrict ourselves to α = 1 and α = 2.

Analysis
V Nα , where α = 1, 2, were computed for a total of 82 microdeletion syndromes of chromosomes 1-20/22/X,Y. Genomic information, including gene names, was obtained from the HUGO Gene Nomenclature Committee's (HGNC) human genome database, comprising 42,764 gene symbols and names and 3245 gene families and sets as of the time of conducting this analysis. Information about the phenotypes and the specific genes implicated in a syndrome was obtained from a survey of the current literature on medical genetics and genomics and from the Online Mendelian Inheritance in Man (OMIM) repository for authoritative information about human genes and genetic phenotypes.
The FOXP2 gene chosen for this analysis has been strongly implicated in speech and language disorders [4,5], including monogenic speech disorders. The cytogenetic location (chromosome locus) of this gene is 7q31.1. Mutations in this gene are known to cause speech and language disorder Type 1, also called "Autosomal dominant speech and language disorder with orofacial dyspraxia". The phenotype description and known molecular basis for this disorder can be found under OMIM entry SPCH1:602081. The FOXP2 gene encodes for the protein "Forkhead Box Protein P2" [6]. This protein is a transcription factor; it controls the activity of other genes. It binds to the DNA of the genes that it controls through a region known as a Forkhead Domain. It thus plays a critical role in several protein-coding and other biological pathways and has been well studied [7]. A more detailed summary of this gene can be obtained from the Human Protein Atlas [8].
The ensemble of pathways used for this analysis was obtained from the Carcinogenic Potency Database (CPDB), described on its website as "a single standardized resource of the results of 45 years of chronic, long-term carcinogenesis bioassays". Its current database of human biological pathways contains 4319 pathways and their gene compositions. This database has been used extensively in the medical literature and was chosen in this case for illustrative purposes since there is (importantly) no inherent bias towards the speech phenotype in it. In this database, there is only one pathway that contains the gene FOXP2. This is the Adenoid Cystic Carcinoma (ACC) pathway, which contains 63 genes, listed below for reference: Gene membership of the ACC pathway: Table A1, given in Appendix A, documents the voice chains found for a set of 75 documented microdeletion syndromes. This range excludes chromosome 21, for which sufficient documentation was not found in the literature. Only voice chains up to level 2 are shown in this table and used in the analysis presented in this paper. This is sufficient to demonstrate the viability of the methodology for the discovery of voice chains proposed in this paper. The entries in the rows and columns of this table are explained in detail in Appendix A. Table 1 summarizes some of the information in Table A1 to help understand the analysis given in the next section. The information given in Table 1 includes, for each syndrome listed in it, the corresponding implicated genes that are also discovered to be chainlink genes by the algorithm proposed in this paper; the overall counts of level-1 and level-2 chainlink genes for each syndrome, along with the number of additional pathways they collectively connect to (in parentheses); and the corresponding phenotypic effects on speech that have been reported in the scientific literature. Table 1. Information about level-1 and level-2 voice chain ensembles for 76 chromosomal microdeletion syndromes. For each syndrome, the corresponding implicated gene (culled from medical literature, with details given in the Appendix A) that was also discovered to be a chainlink gene by the proposed algorithm in either level-1 or level-2 chains is listed in the second column. The number of chainlink genes (i.e., genes that have chains that connect to the ACC biological pathway of FOXP2) is shown in the third and fourth columns for level-1 and level-2 chains, respectively. The total number of pathways that they collectively influence is indicated next to each count, in parentheses. The observed phenotypic effects on speech are given in the last column. Del: delayed speech; Imp: impaired speech; Norm: normal speech; Abs: absent speech; Apr: apraxia; Dys: dysarthria; Idio: idiosyncratic.

Inferences
A wealth of conclusions can be drawn from Table 1 (and from its more detailed version, Table A1 in Appendix A). However, we focus only on those that help validate the usefulness of the proposed algorithm.

Voice Chains as Predictors of Speech Characteristics
Of the 76 syndromes in Table A1, voice chains were found to exist for all. By our hypothesis, this would imply that in all cases, there is a potential for voice to be affected. The syndromes 15q11-q13 and Xq28 have two versions each, divided in the medical literature based on symptoms, rather than gene composition of the microdeletion region. We can therefore combine them for analysis, leaving us with 74 syndromes to be analyzed. For the syndrome 16q22, no information about the speech issues was found in the medical literature. Only the remaining 73 syndromes are considered in the analysis below.
The incidence of speech pathologies (including all forms of pathologies) among the general population is reported to be about 5% [65], and between 2.3% and 24.6% among children [66]. Of the 73 syndromes, 17 syndromes had both level-1 and level-2 voice chains, while 56 had only level-2 chains. The occurrence of speech aberrations was reported for all 17 syndromes with level-1 chains and for all but 6 of the 56 syndromes with only level-2 voice chains. Thus, voice chains correlate highly with the existence of speech anomalies.

Voice Chains as Information-Carrying Entities
Let us study how voice chains correlate with the presence or absence of specific voice problems. Such correlations would show that voice chains carry information about how the voice may be affected. This information is expected to be coarse-grained since we only take the presence or absence of any gene into consideration and consider no other cytogenetic information related to it.
From Table A1, we observe the following.

Level-1 chains:
The number of level-1 chainlink genes is limited to 1 or 2 in all cases and is not amenable to statistical analysis. However, we make the following observations: 1.
Level-1 voice chains co-occur with speech problems 100% of the time.

2.
For all instances where level-1 voice chains are present, severe symptoms occur 100% of the time (impaired, delayed or absent speech).

3.
For all instances of syndromes with no effect on speech (i.e., where normal is not just one of a range of other speech symptoms), level-1 chains are absent 100% of the the time.

Level-2 Chains:
Level-2 voice chains are present in all cases and co-occur with speech disorders in all but 6 cases; thus, in only 6 cases has speech been reported to be normal. Therefore, level-2 voice chains co-occur with speech problems 91.8% of the time.
We note that a syndrome may have level-2 voice chains through many chainlink genes, which could number in the tens or even hundreds. Each of the chainlink genes could, in turn, also be associated with multiple other pathways, in addition to the one connecting it to FOXP2. We refer to the total number of pathways that include the chainlink genes of a syndrome as its "chainlink connectivity". Table 2 presents some statistics of syndromes, chainlink genes, and chainlink connectivity associated with speech disorders of different severity.The problems considered are: absent speech (the most severe symptom), impaired speech (a symptom that is less severe than absent), delayed speech (a cognitive symptom that is also less severe than absent and comparable in severity to impaired speech-a physical symptom), dysarthric speech (a symptom related to physical issues), and apraxic speech (due to CNS disorders; this is less severe compared to absent speech and often subsumes idiosyncratic speech). Each row of the table represents one type of speech problem and shows the number of syndromes associated with it, the mean and median of the counts of chainlink genes for the syndromes, and the mean and median of the chainlink connectivities of the syndromes. From an inspection of Table 2, a distinct pattern emerges. Rank ordering the symptoms by ascending order of the means of the counts of chainlink genes, we see that their connectivities also fall in almost the same order: normal(32, 404) < apraxia(43, 675) < dysarthria(43, 725) < impaired(46, 734) (1) < delayed(47, 682) < absent (60,902) This rank ordering is consistent with the rank ordering of symptom severity based on the descriptions in the medical literature. In general, statistically speaking, the number of chainlink genes and the chainlink connectivity both appear to relate monotonically to the severity of the speech disorder.  Figure 3 shows scatter plots for counts of chainlink genes, chainlink connectivity, and a scatter of chainlink gene counts vs. normalized (per-chainlink-gene) chainlink connectivity for different severities of voice problems. Once again, it is apparent from the figures that the distributions of chainlink counts and chainlink connectivity is predictive of the type of speech problem. In particular, as is evident from Figure 3c, the distribution for normal speech stands out distinctly, as does that for absent speech, although the latter is not as distinctive as the former. Among the other levels, the distributions for apraxic and dysarthric speech appear similar, and so also do those for impaired and delayed speech appear similar.
In order to quantify these differences, we modeled the distributions of chainlink counts and chainlink connectivities for the different severity levels. These distributions have the general characteristics of over-dispersed Poisson distributions and can be modelled as Conway-Maxwell-Poisson (CMP) distributions [67]. The CMP distribution is a twoparameter exponential-family PMF over non-negative integers that takes the form where λ, ν > 0 are the parameters of the distribution. Given a set of integers, λ and ν can be obtained through a maximum likelihood estimator [68]. Figure 4 shows the maximum likelihood estimates of the probability distributions of chainlink counts ( Figure 4a) and connectivities (Figure 4b) for syndromes associated with speech problems of different severity levels. These, too, follow the visible trends of Figure 3, where the distributions of both chainlink counts and chainlink connectivities for syndromes associated with the two extreme conditions, normal and absent speech, are distinct from those for other types of problems.
To quantify the differences in the distributions, we define the code distance between two sets of integers C i = {n 1 , · · · , n i } and C j = {m 1 , · · · , m j } as the excess number of bits required to encode them if each set is encoded using the optimal code for the other set rather than itself.
where P i () and P j () are the estimates of the PMFs for C i and C j , respectively. In our case, we choose the maximum likelihood estimates of the CMP distributions for the sets to compute this metric. Table 3a shows the code distances between the chainlink counts for different types of speech problems. Table 3b shows the same for their chainlink connectivities. In both cases, we observe that the distributions for normal speech stand clearly apart from those for the other types of speech problems. The distributions for fully absent speech, too, are distinctive from those for other problem types. Among apraxic, dysarthric, impaired, and delayed speech, the differences between the distributions of adjacent degrees of severity is minimal; however, the distances show a distinct increasing trend with increases in the degree of impairment.   Overall, from the above analysis, the properties of the level-2 chains of a syndrome appear predictive of the degree of the speech problems associated with it. Our analysis has considered the chainlink counts and connectivities indpendently, and each of them shows this behavior. A joint analysis of both may show stronger dependencies.
Most importantly, note that in all of the analysis above, we have ignored the secondary effects of other issues, such as intellectual disability and craniofacial anomalies, a highly simplifying assumption. A more correct information measure that takes these into account is expected to show even stronger relationships between the level and degree of connectivity of a syndrome to the FOXP2 pathway and its effect on speech.

Why Are There No Instances of Missing Voice Chains?
Are voice chains redundant? The fact that there are no missing voice chains is easily explainable. The reason is linked to the size of the syndromic regions. To understand this, consider the following facts.
Our database comprises 4319 unique pathways. A total of 1205 of these pathways are linked to the pathway that carries the FOXP2 gene, and collectively, these include 11,746 genes. Thus, a randomly chosen gene from the entire human genome of 42.7k genes (as in the HGNC Human Genome database) has a 27.5% chance of being on a pathway that links to the FOXP2 containing pathway, i.e., of being a level-2 chainlink gene.
The shortest microdeletion considered (2q23.1) includes 9 genes, each of which has a 27% chance of being a level-2 chainlink gene. The syndrome itself then has a 94.44% probability of having a level-2 voice chain purely by chance. The second shortest pathway includes 26 genes and has a 99.98% probability of having a level-2 voice chain by chance. The remaining pathways are larger (in terms of the number of genes), and it is virtually impossible for them not to have a level-2 chain.
As a result, it is realistic to expect that, as a consequence of the density with which the FOXP2 containing pathway is linked to other pathways, any syndrome arising from genetic aberrations that includes even a moderately sized set of genes will have an effect on voice. It remains a plausible hypothesis that any factor that influences gene function has at least some chance to ultimately affect voice-for example, at least a 27% chance within the boundaries of the example presented in this paper.
The above argument assumes that the genes in a microdeletion region are randomly chosen. The mean of the fraction of genes in a microdeletion that appears in any voice chain is observed to be 28.79% with a variance of 0.014, indicating concordance with the assumption of randomness. A secondary implication is that the likelihood of adjacent genes in the same cytogenetic region to be chainlink genes is independent of one another.

Ancillary Observations
Some important ancillary observations emerge from this study, which may be important to note. These are mentioned briefly below.

1.
For each syndrome, some genes have been identified as largely important-i.e., these are implicated largely for the syndrome's effect on the individual. Of the syndromes for which there is information about implicated genes, we see that in only 8 syndromes (2p16.1-p15, 2q23.1, 9p24.3, 11q23, 13q12.3, 17q23.1-q23.2, 19p13.13, and Yq11), none of the implicated genes appear in the two levels of voice chains shown. In all other cases, the implicated genes impact FOXP2 pathways and are likely to have a bearing on speech anomalies. We have noted earlier that FOXP2 is not the only gene known to be related to voice production. If we had chosen some other gene as an example in this paper (instead of FOXP2), it is likely that the implicated genes for the 8 exceptions mentioned above would appear as chainlink genes (while some others may not). This hypothesis can be easily tested in corresponding experiments.

2.
Identifying candidate genes for further investigation: Using only chainlink genes that appear on level-1 chains as illustrative examples (see Table A1 in the Appendix A for reference), we see that voice chains can be useful in identifying candidate genes for further investigation in the context of speech issues. Some examples are given below. The likely candidates are written in parentheses, while the already implicated genes are indicated in bold: • 1p36 (ARID1A): Although not implicated for this syndrome in studies so far, ARID1A is located in 1p36.11, a region frequently deleted in human cancers [69]. Disruption in its function may lead to the co-occurrence of oncological and speech issues. This hypothesis is verifiable. • 5q35.3 (NSD1): The gene NSD1 appears in a level-1 chain and is also an implicated gene. Ideally, this should not be a candidate for further investigation. However, paradoxically, while effects on speech are expected, the literature reports normal speech for some subjects for this case. This may be a result of biased sampling (the more severe cases may not be conducive to life due to other concurrent severe symptoms, which is a common occurrence in microdeletion syndromes; in some cases, only mosaic individuals survive). This warrants some investigation. • 11p15.5 (HRAS): Although not implicated, and although two studies cited under OMIM: 130650 for this syndrome explicitly mention HRAS as not significant, HRAS has nevertheless been independently found to be extremely significant in RASopathy and cancer studies, e.g., [70]. Its role in this syndrome needs to be re-evaluated given its influence on 347 biological pathways and its strong influence on speech. • 16p11.2 and 16p12.2-16p11.2 (SRCAP): Although not implicated, it connects to only one other pathway in the ensemble, and that is the ACC pathway of FOXP2. The effects on speech are expected to be strong if this gene is aberrant. This gene may be implicated in further investigations. • 17p13.1 (KDM6B): Speech is absent in this syndrome. The gene TP53 is implicated, which also appears at level-1 and is associated with 206 pathways. KDM6B is the only other gene in the level-1 voice chains and connects to only 8 other pathways. It is likely that this gene also plays a strong role in influencing speech and merits investigation. • 17q12 (ERBB2): The gene ERBB2 is associated with 124 pathways. It is a wellknown oncogene [71], in that perturbations in its function have been observed to have deleterious effects. If it is also connected to FOXP2, then its appearance in the voice chain allows a surprising hypothesis-that biomarkers of some oncological conditions may also be present in voice. • 19p13.3 (MAP2K2,UHRF1): MAP2K2 and URHF1 are not implicated. However their appearance as level-1 chainlink genes warrants investigation, especially for MAP2K2, which influences 257 pathways. Prompted by this, a literature search did reveal that MAP2K2 has been implicated in this syndrome recently [72], although this is not on the OMIM records, which were largely consulted for this study. • 22q13.3 (BRD1): The gene BRD1 is not implicated and appears in 9 pathways only, but the effect on speech is severe in this syndrome. This warrants the investigation of BRD1 independently in relation to speech characteristics. A literature search reveals that BRD1 is indeed strongly associated with brain development and susceptibility to both schizophrenia and bipolar affective disorder [73], and consequent effects on speech are highly likely. • Xp11.22 (SMC1A): Although SMC1A is not implicated, it appears in 33 pathways. The speech issues are severe and the gene warrants investigation for this effect. A recent report in the literature has implicated it in severe intellectual disability and therapy-resistant epilepsy in females [74]. The former is known to be associated with severe speech anomalies. • Xp11.3 (KDM6A): Although not implicated, KDM6A warrants investigation. In the literature, it is independently known to be associated with delayed speech and psychomotor development [75].

3.
Expression of speech characteristics: The observation that deletions of genes on all chromosomes ultimately results in the expression of speech anomalies carries significance. From a much broader perspective, this suggests that the effect on speech may be supported by the action of multiple concurrent biological pathways. There may be no single gene or genes (on select chromosomes) that may code for speech capabilities per se, and FOXP2 may be one of a few genes that may consolidate and regulate the speech-and language-related emergent effects. It may be that genes directly code for structural elements in the range of phenotypes, while other properties, such as speech and language abilities, are emergent from the coordination of these (and epigenetic) factors. A more prosaic argument for this can also be presented. Within the ensemble of syndromes analyzed, there are three kinds of of cause-and-effect relationships: (a) syndromes with physical structures of the vocal tract (e.g., craniofacial anomalies that include cleft palate, changes in lip shape, etc.), which adversely affect the biomechanical aspects of voice and speech production, (b) syndromes in which auditory and motor functions are compromised, and (c) syndromes that affect the normal functions of the brain, causing cognitive, learning, memory, and other issues that are, in turn, likely to lead to speech problems. In no case do we see only speech aberrations in isolation of these factors. The associations between speech and other expressed factors have, in fact, been ubiquitously observed, e.g., [48]. This may support the hypothesis that speech abilities are likely to be emergent from an ensemble of factors (including other phenotypes), rather than expressed directly by any "speech" gene.

Conclusions
The hypothesis that the existence of voice chains is correlated with speech characteristics is adequately validated by the statistical analysis presented in this paper. The analysis presented in this paper, in fact, also shows that the level of voice chains is correlated positively with the severity of speech problems. Based on this, a simple information measure has been suggested to rank-order the effects of specific sets of voice chains on speech. We also see how the methodology presented can potentially provide leads to specific genes that might be candidates for further investigation in the context of speech issues and microdeletion syndromes. While the example of chromosomal microdeletion syndromes used for this paper is very specific, the methodology itself may be easily generalized and extended to reveal the potential effects of other diseases with a genetic basis and of other factors that influence gene function in some manner on speech, voice and (in further refinements of the analysis) their specific qualities and characterisitcs. As a specific suggestion, one exercise that would allow for a more comprehensive analysis would be to explore the entire human genome database to identify which genes are connected via voice chains (and to what level), as well as whether or not there have been corresponding effects on voice reported in the biomedical literature. In cases where large amounts of data are available, one could also explore such connections in an entirely data-driven manner, using AI-based biomarker discovery mechanisms. Institutional Review Board Statement: Ethical review and approval were waived for this study since the data used for this research was obtained from public sources. No human subjects research was explicitly carried out for this work.

Data Availability Statement:
The code required to reproduce the results in this paper is archived for public use at https://datadryad.org/ (accessed on 21 September 2022) under the title "Data for Connecting voice profiling to genomics". For each voice chain listed, the first entry denotes the level of the voice chain. Following this is the list of chainlink genes that belong to the corresponding microdeletion region, which are linked to FOXP2 through the corresponding voice chain level.

Conflicts of
• For each gene, the number of pathways that a gene connects to (in general, inclusive of connections to the ACC pathway of FOXP2) is written as its subscript. • All chainlink genes are not named. For brevity, the first 10 genes with the greatest number of links are listed, and the total number of the rest of the genes (total chainlink gene count) is indicated. At the end, the genes with a single pathway link are explicitly listed. They are included in the total chainlink gene count mentioned above. Thus, for example, the level-2 voicechain for the syndrome 2p16.1-p15 has a total chainlink gene count of 17, including all genes that are explicitly listed in the table. • In each row, the genes in the voice chains that have been implicated for the syndrome's phenotypic expression in prior studies are shown in parentheses on the top right.
The genes that are also present as chainlink genes in the voicechains found for the syndrome are shown in bold font.
Second column: • The second column in this table lists the corresponding speech characteristics, with references. Where no reference is cited, the information is found in the OMIM record for the syndrome (where possible, OMIM references are used for brevity). Table A1. Chainlink genes for level-1 and level-2 voice chain ensembles for 76 chromosomal microdeletion syndromes. Each chain forms a link from the gene shown to the ACC biological pathway of FOXP2 and has been automatically derived. The total number of pathways that each gene influences independently is shown as a subscript to its name. Observed phenotypic effects on speech are given in the last column. When references are not cited, the information reflects that in the OMIM records for the syndrome.    Delayed speech [27], Apraxia [28] and Absent speech Delayed speech [34]; Impaired speech [35]  Dysarthric speech, absent speech [36]; impaired speech [37] 11q23. 3