Introduction

Linguistic scholars have traditionally differentiated between two typical properties of human language: combinatoriality and compositionality. Although these terms are not used consistently, combinatoriality generally refers to the ability to combine meaningless sounds into meaningful morphemes and words, whereas compositionality usually refers to the ability to further recombine already meaningful elements (i.e., morphemes and words) into new elements with novel meanings (de Boer et al., 2012; Hockett, 1960). Therefore, whereas combinatoriality is characterized by nonmeaningful components, compositionality is assumed to imply the use of meaningful elements. Both combinatoriality and compositionality convey productivity and flexibility to human languages, allowing them to become potentially open-ended despite being based on a limited number of initial components (Chomsky, 1981; Fitch, 2010; Jackendoff, 2011; Werning et al., 2012). Combinatoriality and compositionality are generally considered universal properties of natural human languages that not only characterize spoken language, but also sign language and different forms of human nonverbal communication, including the expressions of emotions via facial and bodily displays (Cavicchio et al., 2018; Sandler & Lillo-Martin, 2006).

In this paper, we review the current state of the art on compositionality in nonhuman animals, with a special focus on gestural communication and multicomponent signal displays in nonhuman primates (hereafter, primates). We trace the evolutionary origins of human compositionality in the multicomponent communication systems of other species to identify possible lines of future research based on the main limitations of current research on this topic. We especially focus on compositionality, rather than combinatoriality, because evidence of compositionality is still scant in primates, likely because compositionality implies the creation of novel meanings (through the recombination of different components) and therefore is not easy to prove. We start our review by discussing the importance of a comparative approach to the study of language evolution, providing evidence of compositionality in the vocal communication systems of nonhuman species. We then describe the current state of the art on compositionality in the vocal, facial, and gestural communication systems of primates and other species and review alternative approaches to study compositionality in primates, which may help overcome some of the current methodological limitations in this research area. In particular, we first focus on the importance of using a multicomponent and multimodal approach to the study of compositionality. We then introduce different approaches that are used to infer the meaning of signals and signal combinations, with a special focus on the use of contextual cues and meta-communication. We further discuss temporal and intentional aspects of compositionality in primates to conclude with possible lines for future research on compositionality, especially stressing the importance of large data-sets and large-scale collaborations to overcome current limitations in the study of compositionality.

Comparative Approach to the Evolution of Human Language

According to some authors, language evolved from scratch in humans and fundamentally differs from other animals’ communication systems, so that a comparative approach would not be informative in searching for the evolutionary origins of human language (Chomsky, 1966; Scott-Phillips, 2015). However, a trait as complex as language is unlikely to have evolved in such a short time only in the human lineage (Pinker & Bloom, 1990), and it is probable that language rather built on traits already present in the common ancestors of other species, including neurobiological substrates (Arbib, 2005, 2017), anatomical structures, such as the vocal apparatus (Fitch et al., 2016; Riede et al., 2005) and cognitive skills (Seyfarth et al., 2005). Therefore, comparisons with other species, and in particular with nonhuman primates, may be very useful to identify the potential precursors to human language (Tomasello, 2008; Zuberbühler, 2005).

To identify the possible precursors to human language, however, researchers focus on different aspects of primate communication, largely depending on which communicative modality is assumed as the origin of human language (e.g., gestural, vocal or orofacial origin; Slocombe et al., 2011). As a consequence, the majority of existing research on primate communication is unimodal, and researchers usually study one specific modality in isolation, by using different theoretical and methodological approaches (Liebal et al., 2022; Slocombe et al., 2011). The majority of studies has been conducted on vocalizations, which are mostly studied in wild and captive monkeys, often with a focus on their functionally referential usage in very specific contexts. Gestures are mainly studied in great apes, mostly in captive settings, with a focus on the signaler and the intentional and flexible use of gestural repertoires across different contexts. Facial expressions are frequently studied in monkeys, also mostly in captive settings, with a focus on the adaptive function of these signals (Liebal et al., 2013; Slocombe et al., 2011; Waller et al., 2016). This means that comparative approaches to compositionality in primates are addressed differently, as researchers working on gestures, vocalizations, or facial expressions focus on different aspects of primate communication. Furthermore, our current knowledge on primate communication is still limited, because the majority of studies on this topic has been conducted on vocalizations, and very few studies have investigated the multicomponent or multimodal combination of signals (see 3.1). Moreover, studies on compositionality are largely limited to very few species. This problem affects the whole research area of communication, with 24.5% of the published unimodal studies having been conducted in chimpanzees (Pan troglodytes), 14.6% in rhesus macaques (Macaca mulatta) and 8.8% in common marmosets (Callithrix jacchus), so that most species are severely underrepresented (Liebal et al., 2022).

Current State of the Art in Research on Compositionality in Primates and Other Species

The purpose of the next two sections is to summarize the current state of the art on research into compositionality, separately for the vocal communication system and the gestural and facial communication systems of nonhuman species and, especially, primates. First, we discuss current evidence of compositionality in primate vocal communication, also referring to previous work on nonprimate species, such as birds, whose vocal systems (including aspects of compositionality) have been the focus of abundant research. Second, we review studies on compositionality in gestural and facial communication. As these studies, to our knowledge, have been conducted exclusively on primates, we will also briefly discuss research on humans for comparative purposes.

Compositionality and Other Properties of the Vocal Communication Systems of Nonhuman Species

For several decades, compositionality has been considered one of the hallmarks of human language, setting human communication apart from that of other animals (Hurford, 2011). In past years, however, experimental evidence has shown that several species do not only use complex repertoires to communicate with conspecifics but that their communication systems also often show properties that had long been considered to be uniquely human. Research on vocal communication in nonhuman species, in particular, has shown several similarities with humans in several important aspects of vocal communication: from vocal learning to phonology and syntax (Fishbein et al., 2019).

Several taxa, for instance, have complex sequential calls with regularly recurring patterns, hierarchical structures and/or specific ordering rules, suggesting that these properties of communication are phylogenetically widespread and likely to have emerged through convergent evolution (e.g., primates: Clarke et al., 2006; Hedwig et al., 2015; Girard-Buttoz et al., 2021; cetaceans: Shapiro et al., 2010; Allen et al., 2019; bats: Bohn et al., 2009; rock hyraxes, Procavia capensis: Kershenbaum et al., 2012; birds: Sasahara et al., 2012; Deslandes et al., 2014; Weiss et al., 2014; Cody et al., 2016; Hedley, 2016). Moreover, the communication systems of several nonhuman species share the same statistical patterns of human language. In line with Zipf’s law of brevity (Zipf, 1949), for example, more common vocalizations tend to be shorter in several species (e.g., Formosan macaques, Macaca cyclopis: Semple et al., 2010; bat species: Luo et al., 2013; male rock hyraxes: Demartsev et al., 2019; penguins: Favaro et al., 2020). In line with Menzerath’s law (Menzerath, 1954), longer sequences of vocalizations are composed of shorter calls in geladas (Theropithecus gelada) (Gustison et al., 2016) and gibbons (Nomascus nasutus and N. concolor) (Huang et al., 2020). Regularities in the structure of communication systems do not necessarily provide information about the meaning of sequences and about the compositional aspects of communication but suggest that some properties that constrain human language evolution are largely shared across species, and the divide between communication in humans and other animals may be narrower than previously thought.

Despite abundant evidence of combinatoriality in the vocal communication systems of several nonhuman species (Townsend et al., 2018), evidence of compositionality is however scanter. Among birds, compositionality has been shown in pied babblers (Turdoides bicolor; Engesser et al., 2016), and in Japanese great tits (Parus minor), who react differently to single notes and to their combinations (Suzuki et al., 2016). Among primates, evidence of compositionality has been found in putty-nosed monkeys (Cercopithecus nictitans), who can combine specific alarm calls into new vocalizations that have a novel meaning (Arnold & Zuberbühler, 2006, 2008). Furthermore, Campbell’s monkeys (Cercopithecus campbelli) can combine meaningful vocalizations into context-specific sequences (Ouattara et al., 2009a) and modify the meaning of their alarm calls by adding a suffix to the call (Ouattara et al., 2009b). For instance, adult males combine loud calls into specific sequences that are more likely used in certain contexts (e.g., in the presence of specific predators, or other nonpredatory animals) and which may vary depending on how the information transmitted was acquired (Ouattara et al., 2009a). Moreover, chimpanzees can combine pant hoots and food calls in a context-specific way, suggesting that the new sequences may acquire a novel meaning (Leroux et al., 2021).

Compositionality in Primate Gestural and Facial Communication

Although the previous section suggested that some primate species appear to compositionally recombine vocalizations, there is little evidence that they also can combine nonvocal signals (i.e., gestures and facial expressions) into sequences with a novel meaning. To date, however, there is virtually no research on the sequential combination of facial expressions in nonhuman species, although such combinations are sometimes referred to as “blended displays” (i.e., displays that share characteristics with at least two other prototypical facial expressions, but have no different meaning from them; Parr et al., 2005). In chimpanzees, for instance, blended displays often are used in the same context in which one of their composing elements are typically produced, suggesting that combinations of facial expressions may mirror conflicting emotional states, rather than convey novel meanings (Parr et al., 2005).

With regards to gestural communication, primates have relatively large repertoires (Hobaiter & Byrne, 2011; Liebal et al., 2013). However, the few studies that have investigated sequences of gestures in primate communication systems have found no evidence that sequences have a novel meaning as compared to their elements (e.g., western gorillas, Gorilla: Genty & Byrne, 2010; Tanner & Perlman, 2017; Sumatran orangutans, Pongo abelii: Tempelmann & Liebal, 2012; chimpanzees: Liebal et al., 2004; Hobaiter & Byrne, 2011). In particular, although these studies have offered different conclusions about the emergence of gesture combinations, none of them has found evidence that primates combine gestures into longer sequences to specifically create novel meanings. Captive chimpanzees, for example, frequently produce sequences of gestures, but the majority of these sequences consist of simple repetitions of the same gestures, mostly tactile ones, which likely serve to increase recipients’ responsiveness (Liebal et al., 2004). Wild chimpanzees also use long and largely redundant sequences of rapid-fire variable gestures (Hobaiter & Byrne, 2011). However, older individuals are more likely to use single gestures, rather than sequences, suggesting that sequences are mostly used by inexperienced individuals to increase their chances of obtaining a response and to learn how to communicate more effectively with conspecifics (Hobaiter & Byrne, 2011). As for chimpanzees, Sumatran orangutans usually produce sequences of gestures mostly consisting of repetitions of the same gestures (Tempelmann & Liebal, 2012). Moreover, orangutans often continue to gesture regardless of whether the recipient responds, suggesting that sequences of gestures in this species may be mainly triggered by emotional arousal (Tempelmann & Liebal, 2012). Therefore, across primate species, there seem to be no significant differences between the meaning of a gesture when produced as part of a sequence and when used alone, casting doubt on the existence of compositionality in primate gestural communication (Liebal et al., 2004).

New Approaches to Study Compositionality in Nonhuman Species

Why should primates be able to compositionally combine vocalizations, but not other signals? Much of primate social and technological behaviour contains hierarchically structured sequences of actions (Estienne et al., 2017), and primates are surprisingly good in understanding hierarchically structured information (Watson et al., 2020). Therefore, it is unlikely that compositionality is limited to the vocal communication system because of cognitive constraints, and methodological issues may better explain why there is not yet clear evidence of compositionality in primate gestural and facial communication. Therefore, the following sections will focus on some blind spots in comparative approaches to compositionality in nonhuman species, specifically primates, and make some suggestions on how to address them.

(Another) Call for Multicomponent and Multimodal Approaches to Study Compositionality in Primate Communication

One reason why primates may not show evidence of compositionality in their gestures or facial expressions is that primate communication is multicomponent and multimodal (Slocombe et al., 2011; Liebal et al., accepted), and compositionality might occur through the recombination of different elements within and across modalities. In particular, whereas multimodal combinations of signals include signals produced in different sensory modalities (e.g., visual, auditory, tactile) (Rowe, 1999), multicomponent combinations include multiple signals regardless of their modality (e.g., gestures and facial expressions) (Micheletta et al., 2013; Waller et al., 2013, 2022). However, very few studies have investigated whether primates compositionally recombine elements to create new meaningful elements with a multicomponent and/or multimodal approach.

A recent study on red-capped mangabeys (Cercocebus torquatus), for instance, investigated combinations of signals across different modalities (Aychet et al., 2021) and identified 424 combinations. The majority involved multiple components, often in more than one modality. Combinations differed in their complexity depending on the social context in which they were used, being more complex in playful and partially aggressive contexts compared with others (Aychet et al., 2021). Other studies have used such a multicomponent or multimodal approach but have not specifically assessed compositionality in these combinations. Some researchers have investigated how primates combine gestures with vocalizations but did not specifically assess whether these combinations elicit a different response compared with their single components (Hobaiter et al., 2017; Wilke et al., 2017). In wild chimpanzees, in particular, gesture-vocal combinations occurred relatively seldom, but they were found in all age groups (Hobaiter et al., 2017; Wilke et al., 2017). Moreover, primates often used multimodal combinations after vocal signals that failed to elicit recipients’ response (Hobaiter et al., 2017), suggesting that primates might recur to combinations of signals to increase the probability of eliciting a response, rather than to convey novel meanings. Multimodal combinations were indeed more likely to elicit a response as compared to the single vocal elements they included (Wilke et al., 2017), although in these studies there was unfortunately no assessment of the kind of responses elicited by combinations and their elements.

Other researchers investigated multicomponent combinations in chimpanzees, focusing on the combination of gestures and facial expressions in a semi-wild setting (Oña et al., 2019). In this study, different combinations elicited different responses: adding a specific facial expression to a gesture modified the likelihood of a specific response to occur (Oña et al., 2019). In particular, the bared-teeth face was linked to an increase in recipients’ affiliative behaviour when it was combined with one gesture type (i.e., stretched arm gesture), but not with a different gesture type (i.e., bent arm gesture). Thus, this study did not conclude that the different facial and gestural components have specific meanings and that new meanings are consistently created through their combination (i.e., compositionality), but it suggested that facial expressions may fulfil the important function of modifying following or co-occurring gestures—a property that the authors referred to as “componentiality” (Oña et al., 2019).

Defining the Meaning of Signals: Challenges and New Approaches

A second reason why primates may be able to compositionally combine vocalizations, but there is no clear evidence that also may be compositionally combine other signals, might be that, at least in gestural studies, the “meaning” of signals has been traditionally inferred from the context in which signals are used (Liebal & Oña, 2018). Inferring the meaning of a gesture from the context in which it takes place, however, might be misleading, not only because some gestures are used in a variety of contexts, but also because, even within the same context, gestures may convey very different information (Hobaiter & Byrne, 2014). Therefore, a more promising approach may be to infer the meaning of gestures by monitoring the response they elicit from their recipients (Genty et al., 2009; Hobaiter & Byrne, 2014). When recipients react to a signal by producing nonaversive behaviour and signallers stop gesturing, the recipients’ behaviour may be considered to approximate the signallers’ goal and thus the meaning of the signal itself (Hobaiter & Byrne, 2014). Studying compositionality in gestural communication with this more precise operational definition of “meaning” might lead to different results.

Gestures can be combined in many different ways, and used in a variety of contexts, so that vast datasets are required to obtain a comprehensive perspective on the possible meaning of gestures and their multicomponent combinations. Moreover, many combinations of signals are likely to contain elements of persistence and emphasis (Liebal et al., 2004), simply changing the intensity of the information conveyed. Therefore, in many cases, combinations of signals may trigger the same response as the single signals, despite having a different meaning (at least in terms of intensity). Large datasets including further measures of recipient’s response may help detecting these subtler changes in the meaning of signals. When the combination of signals alters the meaning of the single signals only in term of intensity, for example, this change may be captured by measuring recipients’ reaction time.

Even the identification of single gestures is problematic. In chimpanzees, for instance, the number of different gesture types produced varies from more than one hundred (Roberts et al., 2014) to less than 30 gestures (Pika et al., 2005), depending on how gestures are operationally defined. These top-down classifications are largely based on researchers’ intuition, but finer-grained distinctions based on bottom-up approaches that focus on the different forms of single gestures may reveal a more complex picture (Bard et al., 2019). For instance, have we really identified the meaningful units of gestural communication, or are current gestures themselves combinations of independent actions that convey meaning to the recipient? All our findings, so far, are conditional on the assumption that we have properly identified primate signals—something we cannot unfortunately be sure of.

Context: From Gestural Meaning to Meaning Modulator

With regard to meaning, the context also might play a crucial role, as one signal may convey different meanings depending on the context in which it is used. In chimpanzees, for instance, the bared teeth face modulated recipients’ response to different arm gestures, but only during affiliative events (Oña et al., 2019). Therefore, it is possible that individuals may attribute a different meaning to the same signals also depending on the context in which they are used. In bonobos (Pan paniscus), for instance, gestures can have different meanings depending on the specific context in which they are produced, whereas they do not appear to acquire a novel meaning when combined with other gestures (Graham et al., 2020).

However, context has been often used as a proxy of meaning in several studies, but rarely used as a possible modifier of signal meaning. Moreover, the study of compositionality is often limited to very few contexts. Several studies, for instance, have investigated how certain signals acquire a novel meaning in the playful context, where individuals exchange unpredictable behavioural patterns that may easily escalate into overt aggression, so that clearly conveying the playful meaning through specific signals may be highly adaptive (Bateson, 1955; Burghardt, 2005; Pellis & Pellis, 2009; Spinka et al., 2016). However, signals may be compositionally combined in other contexts (e.g., aggression or feeding), and they may alter the meaning of behavioural elements that would otherwise trigger different responses, well beyond the play context. Unfortunately, data in contexts other than play are extremely scant in literature, likely because it is difficult to obtain high-quality information about multicomponent signals in sufficient contexts, especially in the wild. Therefore, future studies on compositionality should systematically account for the several contexts in which single elements and their combinations are produced.

Meta-Communication as a Complimentary Approach to Study Compositionality in Primate Communication

Meta-communication is typically used to refer to those signals (“secondary communication”) that alter the meaning of other behavioural elements (“primary communication”; Bateson, 1955; Mitchell, 1991, for a detailed discussion about the use of this term). Meta-communicative signals, for instance, are those that convey a playful meaning to behaviours that otherwise belong to the aggressive repertoire of a species, such as hitting or wrestling, or that are generally used in other functional contexts (Bateson, 1956; Bekoff & Allen, 1998; Fagen, 1981; Pellis & Pellis, 2009). In contrast to compositional combinations, meta-communicative signals add an extra layer to the communicative interaction, by communicating the true intention of an action or signal, which is produced in parallel (e.g., a play face signalling the playful intention before producing a potentially agonistic hit gesture). However, as for compositional combinations, signals and behavioural elements acquire a different meaning when they are combined compared with when they are independently produced. Therefore, meta-communication may prove an important complimentary concept for studying compositionality, and more specifically, to investigate the change of a signal’s meaning when combined with another and different signal type.

In primates, a variety of signals might serve a meta-communicatory function, including facial expressions, body gestures and vocalizations (Bekoff, 1972, 1995; Yanagi & Berman, 2014). Rhesus monkeys, for instance, use specific body movements (e.g., crouch and stare, gambolling, roll-onto-back-and-stare) to signal the onset of social play, and/or facilitate and moderate its occurrence (Sade, 1973; Yanagi & Berman, 2014; also see Petrů et al., 2008). Similarly, several primate species are thought to initiate social play by slapping, touching, or pawing at conspecifics, whereas looking between the legs or arm extensions may be used to maintain social play (Tomasello & Call, 1997). However, most research has focused on the meta-communicatory function of the play face (or relaxed open mouth-face)—a facial expression that primates often use in the context of play (van Hooff, 1972; Palagi, 2008; Spinka et al., 2016). Play faces, in particular, appear to clarify the playful meaning of behaviours that may otherwise appear agonistic (Bekoff & Allen, 1998; Palagi, 2008, 2009; Pellis & Pellis, 1996), likely preventing aggressive escalations among players (Waller & Dunbar, 2005). Indeed, play faces more frequently occur during rough or contact play, such as wrestling (Chevalier-Skolnikoff, 1974; Demuru et al., 2015; Fedigan, 1972; Palagi, 2007; Palagi & Paoli, 2007), and they often are associated with longer play bouts (Palagi, 2007; Spinka et al., 2016; Waller & Cheery, 2012) and a higher number of players (Palagi, 2008). Therefore, play faces appear to serve a crucial communicative role during social play in primates, because they seem to convey the sender’s intention before producing a second behaviour (such as hitting or chasing), which could potentially also be used in agonistic interactions. Although they do not necessarily change, but disambiguate a signal’s or behaviour’s meaning, such meta-communicative signals may be a crucial element in compositional combinations of multicomponent elements.

Temporal Aspects of Compositionality

There also is no consensus in the literature on how temporally close two elements should be, to be considered part of the same combination. Some authors consider signals to be elements of a combination when they are produced simultaneously, within or across modalities (Liebal et al., 2010; Partan, 2002). However, there is no agreement with regard to whether simultaneously produced signals have to overlap completely, or whether the second signal might follow with some delay. Similarly, for sequential combinations, intervals between two consecutive signals vary from one (Genty & Byrne, 2010) to several seconds (Liebal et al., 2004). With regard to meta-communicative signals, most authors suggest that they do not need to be produced right before the behavioural elements whose meaning they should modify, but they may be produced simultaneously or immediately afterwards (Beresin & Farley-Rambo, 2018; Fedigan, 1972; Palagi, 2008; Pellis & Pellis, 1996; Schwartzmann, 1979; Spinka et al., 2016). Therefore, play faces may repeatedly occur during the play bout to continuously assert the playful intent of behavioural patterns that happen before, during or after the play face, maintaining and prolonging social play rather than initiating it (Fagen, 1981; Palagi, 2007, 2008; Pellis & Pellis, 1996; Waller & Cheery, 2012; Wright et al., 2018; Yanagi & Berman, 2014). In vocal communication, in contrast, the elements of a combination cannot be produced simultaneously, but they must be in close temporal proximity to compositionally acquire a new meaning, and they also must be ordered in specific ways, because only certain sequences of elements trigger a novel response (Ouattara et al., 2009a; Suzuki et al., 2016).

For gestural sequences, researchers have usually established a priori temporal timeframes in which gestures have to occur to be considered part of the same sequence. In chimpanzees, for instance, gestures were considered to be part of the same sequence if they were directed in the same context to the same recipient, and they were not separated by more than 5 seconds from each other (Liebal et al., 2004). In gorillas, gestures were similarly considered to belong to the same sequence if they were separated from each other by no more than 1 second (Genty & Byrne, 2010). This approach provides a clear temporal timeframe in which to analyse gestural sequences, but the decision of this timeframe may be arbitrary. Other researchers have therefore used a different approach, considering a sequence to be formed by all the gestures performed by an individual toward a recipient until the signaller stops gesturing and/or the recipient responds (Hobaiter & Byrne, 2011). However, this approach may be problematic, because it fails to operationally differentiate between gestural sequences that result from a failure to elicit a response, and those that may have been compositionally recombined to convey a novel meaning.

A better approach may therefore be to statistically identify combinations of signals by extracting clusters from gestural sequences, based on the strength of association between the single elements (Kershenbaum et al., 2016). To do that, it may be possible to focus on the signals produced by the same individual toward the same recipient, which are not mere repetitions of the same signal (Liebal et al., 2004). Then, cluster algorithms may be used to detect temporal dependencies between specific signals, selecting clusters of elements that are more likely to occur together within the communication flow. For example, cluster sequence mining may account for the order of signal occurrences and the distribution of time intervals, generating and evaluating candidate clusters with hierarchical clustering without setting a priori temporal constraints (Fukui et al., 2019). By categorizing signals into clusters, researchers might identify regularities in primate communication systems and then test whether such clusters are compositional combinations of signals that have acquired a novel meaning. As for single gestures, the meaning of signal combinations may be inferred from the response given by the recipient (Hobaiter & Byrne, 2014) and then compared with that elicited by its single elements. In this way, it may be possible to assess whether recipients provide different responses to combinations compared with their elements and thus whether combinations are produced compositionally.

Similar approaches have been recently used to investigate combinations of signals in mangabeys and chimpanzees. In mangabeys, for instance, the authors used sequence analyses to identify possible multicomponent combinations of signals based on dissimilarity measures and then used network analyses to identify the typical combinations produced by mangabeys (Aychet et al., 2021). In this way, they could identify eight major kinds of combinations, which included signals across multiple modalities. In chimpanzees, the authors used collocation analyses to calculate the probability of signal co-occurrence (i.e., whether pant hoots and food calls co-occurred with a higher frequency as compared to combinations of either signal with other calls; Leroux et al., 2021). The authors showed that pant hoot and food calls, which are meaningful elements in the vocal communication systems of chimpanzees, were more likely to be combined together. These combinations were more likely to happen in specific contexts (i.e., when higher-ranking individuals were present and the food patches were larger; Leroux et al., 2021).

Intentional Production of Compositional Structures/Combinations

There is currently no research on whether combinations, including gestures, are planned, voluntarily produced means of communication, in which signallers intentionally combine elements to convey new meanings. Intentionality is a key feature of human language, and intentional production is an inherent part of the gesture definition in comparative communication research, although this aspect has been considered to a much lesser extent in facial and vocal research (Liebal et al., 2013). Therefore, gestures are usually coded by default as intentionally produced means of communication that are flexibly adjusted to the recipient’s behaviour, even when they may not always be produced intentionally, whereas vocalizations and facial expressions are often presented as merely involuntary, spontaneous expressions of emotional states (Chevalier-Skolnikoff, 1974; Scopa & Palagi, 2016; Tomasello, 2008). Therefore, different approaches in which all signals can include an emotional and an intentional component, independently of the signal modality, may be important (Graham et al., 2019; Heesen et al., 2021).

Importantly, although researchers have proposed different criteria to identify intentional communication, there is yet no agreement on which and how many of these criteria need to be met to consider a signal as being intentional, nor is there consistency on how these criteria are applied across modalities. These criteria, for instance, include the social use of signals (i.e., signals are always produced in the presence of a recipient), sensitivity to recipients’ attentional state (in the case of visual signals), as well as persistence and/or elaboration (i.e., signals are repeated until the recipient produces a response, and might be even elaborated to elicit the recipient’s response in case of initial failure). However, these criteria have not yet been used to study compositionality, and to our knowledge intentionality is not usually considered a necessary prerequisite for compositionality, neither in human literature nor in literature on other species (Arnold & Zuberbühler, 2006, 2008; Engesser et al., 2016; Ouattara et al., 2009a, 2009b; Suzuki et al., 2016). Therefore, we still do not know whether primates combine different signals across modalities to create novel meanings, and whether these combinations are characterized by a set of markers indicating their intentional use.

Future Directions

Very few studies have addressed the compositional aspects of communication in species other than humans. These few studies have mostly focused on vocal communication, so we know very little about compositionality in primate gestural communication systems. Future research should consider several aspects to systematically address this issue. First, there is no consensus on how compositionality should be operationalized in nonhuman species, and there are often important differences with the approach used in humans. Terms, for instance, are not consistently applied across studies, and they may have a very different meaning from the ones traditionally used in human research. Addressing these issues will be necessary to set the basis for more systematic comparative work on compositionality.

Second, the study of compositionality and meta-communication in primate gestural communication systems may benefit from the use of other methodological approaches. Candidate combinations of gestures, for instance, should be ideally identified with statistical approaches that allow a more objective clustering of signals. Because primate communication is multicomponent and multimodal (Slocombe et al., 2011), studies of compositionality should assess combinations including several gestures, gestures with facial expressions, and gestures with vocalizations. Moreover, the meaning of single gestures, facial expressions, vocalizations, and their combinations could be inferred from the response they elicit, rather than from the context in which they are produced (Hobaiter & Byrne, 2014). Combinations also should be studied across different contexts to assess specifically whether context modulates recipients’ response to gestural combinations (Oña et al., 2019). Furthermore, it would be interesting to study the specific characteristics of the combinations produced by primates, especially focusing on the presence of markers that suggest their intentional use, on the temporal distance to which the single elements are produced, and on the order in which the single elements are produced, as different orders of elements may clearly affect the meaning of combinations (Townsend et al., 2018). This implies the availability of very large datasets with a variety of primate signals and multicomponent combinations in many contexts and calls for international long-term collaborations with standardized, methodological approaches.

Finally, another aspect to consider is how compositionality in gestural communication might emerge through development. By observing primates at different ages (i.e., from infancy to adulthood) and by longitudinally following some individuals, it would be possible to analyse how compositionality and its different properties emerge through development and whether they follow the same developmental trajectories as in humans. Moreover, it would be interesting to explore the social factors that likely facilitate the emergence of compositionality in primate communication systems. For instance, it is possible that gestures (and their combinations), being more frequently used in affiliative social contexts (Pika et al., 2005), are more likely produced by individuals and groups that have better access to resources and need to invest less time in non-affiliative activities (e.g., higher-ranking individuals, captive groups). However, it also is possible that gestures (and their combinations) are more common across individuals living in socially more complex groups (e.g., in the wild), because they may need to use a wider variety of communicatory signals to fulfil their social needs (Freeberg et al., 2012). Future systematic studies on the gestural communication systems of primates may not only provide the first clear evidence of compositionality in their gestural communication but also shed light into the evolutionary origins of one of the most crucial properties of human communication.