A fascinating but risky case of reverse inference: From measures to emotions!

Inferring emotions based on accessible signals is a tendency that we have both as social individuals and as scientists. Academia and industry have developed methods and devices aimed at detecting specific emotions (e. g., joy, anger or fear) based on physiological or behavioral signals. The current opinion paper argues that this is currently a risky path to be taken in terms of scientific validity. We argue that using measures to test hypotheses concerning emotions is efficient, but that going backward – using measures to infer emotions – is risky. We also argue that ways to circumvent this reverse inference issue include making use of converging evidence across the five components of emotion (cognitive appraisal, action tendency, expression, physiological reaction and feelings), and investing even more in methodological developments. 1. What do we measure when we “measure emotions”? Emotions do not happen by chance, they do not occur erratically, and they are not useless. In fact, understanding the determinants, nature, and functions of emotions has been the object of research programs in most of the academic disciplines interested in the mind, the brain, or behavior. Disciplines such as philosophy, linguistics, literature, history, psychology, anthropology, sociology, economics, neurosciences, and computer sciences all share an increasing interest in understanding emotions. This multidisciplinary call for understanding emotions is a key reason why the interdisciplinary emerging field of affective sciences was born a few decades ago (see e.g., Sander & Scherer, 2009). For some disciplines – but not for others – “understanding” goes hand in hand with “measuring.” For instance, in psychology, neuroscience and affective computing most research traditions involve measuring emotions in one way or another. The quantity and quality of tools that have been developed to measure emotions have exploded in the past twenty years, and affective scientists have access nowadays to advanced toolboxes that allow specific hypotheses-driven measurements of emotional processes. This certainly means that we can increase our understanding of emotion by measuring some key variables in laboratory settings or in everyday life. But, once measures are associated with emotions, scientists, private companies, and layperson alike may have the temptation to think that one can go backwards – from measures to emotions. Based on some measures, can we automatically detect specific ongoing emotions, even without asking the person what she feels? This “decoding” question is neither new nor specific to emotion, and has fascinated society through the ages: based on some objective measures, can we have a direct specific access to a psychological process, and even a mental content? For instance, can we know whether someone is dreaming by measuring his/her electroencephalographic activity? And what about the content of the dream? Can we know whether someone is lying by measuring his/her facial expression? Consider the aim of the so-called “lie detector”: using physiological reactions of an individual to detect whether he/she lies, against his/her verbal report. Even in our daily mundane social interactions, we use behaviors of others to infer what they feel. Do not we infer that two persons are in love if we see them kissing each other in the street? We may make wrong inferences, be we spontaneously make such inferences about the emotions of others based on social and affective cues (see Mumenthaler & Sander, 2019). Inferring emotions based on accessible signals (should they be smiles or intracranial local field potentials) is a tendency that we have both as social individuals and as scientists. Academia and industry have developed methods and devices aimed at detecting specific emotions (e.g., joy, anger or fear) based on physiological or behavioral signals. The current opinion paper argues that – even ethical issues being left aside – this is currently a risky path to be taken in terms of scientific validity. More specifically, in what follows, we argue that using measures to test hypotheses concerning emotions is efficient, but that going backward – using measures to infer emotions – is risky. We also argue that ways to * Corresponding author at: University of Geneva, Swiss Center for Affective Sciences, Chemin des Mines, 9, Geneva CH-1202, Switzerland. E-mail address: sylvain.delplanque@unige.ch (S. Delplanque).


What do we measure when we "measure emotions"?
Emotions do not happen by chance, they do not occur erratically, and they are not useless. In fact, understanding the determinants, nature, and functions of emotions has been the object of research programs in most of the academic disciplines interested in the mind, the brain, or behavior. Disciplines such as philosophy, linguistics, literature, history, psychology, anthropology, sociology, economics, neurosciences, and computer sciences all share an increasing interest in understanding emotions. This multidisciplinary call for understanding emotions is a key reason why the interdisciplinary emerging field of affective sciences was born a few decades ago (see e.g., Sander & Scherer, 2009). For some disciplinesbut not for others -"understanding" goes hand in hand with "measuring." For instance, in psychology, neuroscience and affective computing most research traditions involve measuring emotions in one way or another. The quantity and quality of tools that have been developed to measure emotions have exploded in the past twenty years, and affective scientists have access nowadays to advanced toolboxes that allow specific hypotheses-driven measurements of emotional processes. This certainly means that we can increase our understanding of emotion by measuring some key variables in laboratory settings or in everyday life. But, once measures are associated with emotions, scientists, private companies, and layperson alike may have the temptation to think that one can go backwardsfrom measures to emotions. Based on some measures, can we automatically detect specific ongoing emotions, even without asking the person what she feels? This "decoding" question is neither new nor specific to emotion, and has fascinated society through the ages: based on some objective measures, can we have a direct specific access to a psychological process, and even a mental content? For instance, can we know whether someone is dreaming by measuring his/her electroencephalographic activity? And what about the content of the dream? Can we know whether someone is lying by measuring his/her facial expression? Consider the aim of the so-called "lie detector": using physiological reactions of an individual to detect whether he/she lies, against his/her verbal report. Even in our daily mundane social interactions, we use behaviors of others to infer what they feel. Do not we infer that two persons are in love if we see them kissing each other in the street? We may make wrong inferences, be we spontaneously make such inferences about the emotions of others based on social and affective cues (see Mumenthaler & Sander, 2019). Inferring emotions based on accessible signals (should they be smiles or intracranial local field potentials) is a tendency that we have both as social individuals and as scientists. Academia and industry have developed methods and devices aimed at detecting specific emotions (e.g., joy, anger or fear) based on physiological or behavioral signals. The current opinion paper argues thateven ethical issues being left asidethis is currently a risky path to be taken in terms of scientific validity. More specifically, in what follows, we argue that using measures to test hypotheses concerning emotions is efficient, but that going backwardusing measures to infer emotionsis risky. We also argue that ways to circumvent this reverse inference issue include making use of converging evidence across the five components of emotion, and investing even more in methodological developments (e.g., advances in pattern recognition methods). Reliable measures of emotions would be particularly useful for scientific and clinical research, but also in order to help patients suffering from emotional dysfunctions such as un depression, anxiety, autism, or schizophrenia.

What measuring emotions means
Let us start with the question raised above: what do researchers measure when they "measure emotions"? Answering this question requires both a conceptual definition of what an emotion is, and an operationalization of how some aspects of an emotion may be measured.
With respect to defining emotion, this is beyond the scope of this opinion piece to review and analyze the varieties of definitions and models of emotion that have been suggested (see e.g., Sander, 2013 for such an analysis). We adopt here a view that we believe to be both consensual and of particular interest with respect to the measurement issue. As described in the Fig. 1, this view considers that the construct "emotion" is typically used to describe the processes taking place in a short period of time during which an individual is subject to a number of changes in five components: the elicitation component, and four components of the emotional response, namely expression, feeling, autonomous psychophysiology, and action tendencies. What typically triggers changes in these components is the individual's appraisal of the importance of a (real or imagined) stimulus with respect to his/her wellbeing, goals, or concerns.
What differentiates emotion from other affective phenomena such as moods is that its onset is associated with a particular situation and is of short duration. A key function of emotions is to quickly prepare the individual to react to a situation in an adapted manner, orchestrating coordinated changes in behavioral and cognitive mechanisms. Each component of emotion participates in this adapted response. With this perspective, it is the way in which these five components of the organism are synchronously modified in response to a stimulus that defines which specific emotion is elicited (see Sander, Grandjean, & Scherer, 2018). The coherence among the emotion components is therefore an interesting signature of emotions, with possible links to well-being (see Brown et al., 2020;Mauss, Levenson, McCarter, Wilhelm, & Gross, 2005).
Establishing typologies and taxonomies of emotions (i.e. how many emotions or categories of emotions exist, and what are the relations between emotions and/or categories) has been crucial in the history of affective sciences. It has accompanied the debates on their genesis, their differentiation, and the ways they can be measured. This leads us to the issue mentioned above concerning the operationalization of how some aspects of an emotion may be measured. At a general level, it consists in determining at least two aspects of the emotion: its quality (i.e., which emotion is experienced), and its intensity. At a more specific level, it may consist in measuring simultaneously all the components of emotion; indeed, it seems illusory to think that a single measure could reveal both the quality and the intensity of an emotion. For instance, if we measure activity in a particular brain region when an individual feels an emotion, does it mean that we can infer that anyone who shows an activation in the brain region will feel this emotion? A famous example is the study of fear and the amygdala: evidence clearly shows that the amygdala is activated during fear. Does is mean that we can infer that someone feels fear because his/her amygdala is activated? Certainly not! Indeed, the amygdala in involved in many other emotions than fearincluding positive emotions. Methodological advances in the rapidly evolving field of brain decoding (or "neural decoding") may, one day, allow the detection of mental processes based on brain activity (see Taschereau-Dumouchel & Roy, 2020). Currently, we are far from being able to infer the quality and intensity of an emotion only based on brain activity, although there are significant advances in this area (see Knutson, Katovich, & Suri, 2014;Kragel & LaBar, 2016;Putkinen et al., 2020). Obviously, brain imaging is just one wayand not the most common oneto measure emotions. In what follows, we discuss various measures of the different components of emotion. Then, before concluding, we discuss the fact that any measure of emotion can only provide a probabilistic account of emotion inference, and that converging evidence across the components of emotions may be a way to increase this probability. Measuring all the components of emotion simultaneously during an emotional episode would allow increased reliability for the detection of qualification and quantification of the emotion experienced.

Expression
Facial, vocal and postural emotional expressions represent powerful social signals. There is therefore a battery of measures of this expressive component. This would be far beyond the scope of this paper to describe all methods used to measure this component in a laboratory setting or in the field. Given the aim of this article, let us only mention that a common way to measure facial expressions in the laboratory is to measure the electrical activities of the facial muscles involved in emotional expression. The advantage of this technique is that contractions may be measured even though the person shows no visible expression, for instance because he/she regulates his/her emotion. But these techniques can hardly be used for consumer research outside the laboratory. More and more, private actors are offering solutions on the market to decode emotions from facial expressions on the basis of video recordings or images. This makes facial expression analyses compatible, for example, Fig. 1. The emotion process is described in terms of a temporal dynamic involving antecedent motivations such as current goals, concerns, values that the individual uses to appraise a given real or imagined event. The elicitation continues in the elicitation component that shapes an emotional response that can be measured on multiple components: autonomic physiology, action tendencies, expression, and feelings. The emotional processes modulates several cognitive processes such as attention, memory, learning, and decision-making (Reprinted from Pool and Sander (2021), with permission from Elsevier).
with large-scale consumer or product development studies. Some solutions make it possible to differentiate between prototypical expressions of anger, sadness or disgust that are sufficiently intense. Their discriminating power is very reduced though when the objective is to measure fine emotional differences between globally positive and fairly similar products. In addition, there are many facial movements that are not related to facial expression and can interfere. For example, it is difficult to properly measure emotional facial expression while a person is eating or even drinking. The facial movements caused by such activities considerably interfere with the emotion classification algorithms that are used by these automatic expression recognition systems.

Autonomic reaction
During an emotional episode, physiological changes allow the body to react to the demands of the situation. The specificity of physiological reactions during a specific emotion is a very old debate in the field, at least since William James suggested that strong emotions show specific bodily responses. There are a large number of physiological measures that have been used to qualify and quantify this bodily response (e.g. cardiovascular, electrodermal, respiratory responses etc...). These measures, which were originally used in the laboratory, have been democratized and allow us to consider their use in non-laboratory contexts. It is true that variations in these measures are observed during emotional reactions. However, these bodily adaptations are of low intensity compared to those that naturally occur under other conditions. Cardiovascular measurements, for example, will be extremely sensitive to the individual's effort and the part of variation due to a consequence of an emotion is largely drowned out. Another crucial point is that autonomic responses are very different when emotional situations require very different adaptations to the environment (e.g., in fear versus happiness). It is therefore again very difficult to observe emotional differences when using these measures to investigate reactions to similar situations or products that do not require major different adaptations. Psychophysiological measures reflect the activation of the sympathetic and parasympathetic nervous systems supporting respectively ergotrophic (i.e., propensity for energy expenditure) or tropotrophic (i.e., propensity for energy renewal) functions. The physiological variations observed are therefore overwhelmingly related to these two functions. This considerably limits the possibility of qualifying numerous emotional states. Recent meta-analyses confirm that these measures cannot give clear qualitative indications about the identity of the emotion (see Siegel et al., 2018), and therefore are unlikely to reveal alone the specificity of an emotion.

Action tendencies
Action tendencies correspond to states of the individual that prepare the action for achieving a particular relation with the object that the emotion is about (e.g., approach/avoid in anger/fear, rejection for disgust, submission in shame). Methods to measures action tendencies are relatively scarce as compared to the other components of emotion. Many aspects remain to be explored in order to efficiently measure action tendencies in response to situations or products. They can be investigated using reaction time tasks, posture measurements or verbal self-reports. However, to make correct interpretations, the measurement of action tendencies requires very specific protocols that must generate these tendencies. A typical paradigm is the use of a joystick that is pushed or pulled in order to measure approach or avoidance. However, there is much debate about whether pushing the joystick corresponds to naturally approaching the arm towards the stimulation (i.e., tendency to approach) or pushing away the same stimulation (i.e., tendency to avoid). This debate also exists when the joystick is pulled. The implementation of rigorous protocols is a prerequisite for making sound conclusions and still prevents easy large-scale use of action tendencies measures in the lab but also for consumer testing or product development.

Elicitation: appraisal processes
The way in which the situation is evaluated and gives rise to an emotional response is a great subject of debate in affective sciences. For instance, appraisal theories suggest that the situation is evaluated according to specific criteria such as how relevant the situation is for the individual given his/her current concerns (e.g., primary appraisal), and how he/she can cope with the situation and its consequences (e.g., secondary appraisal). Appraisal theorists have suggested the existence of numerous other criteria that shape a large number of specific emotions (see Scherer & Moors, 2019). The advantage of conceiving the elicitation and differentiation of emotions as a causal succession of appraisals according to defined criteria is that it is possible to manipulate these criteria to design more targeted emotional products (e.g., manipulating the degree of novelty of the product, its correspondence to the needs of the individual). The measurement of evaluation processes requires protocols that trigger such processes (e.g., new stimuli are needed to investigate novelty). Measures from these appraisal processes can be self-reported, as the individual is then engaged in an introspection of the antecedents of emotions. However, not all of these processes are conscious, and available through introspection. It is then possible to use measures of reaction time or accuracy in tasks that target appraisals at an implicit level. Central psychophysiological measures (e.g., EEG and MRI) have also been used for uncovering implicit or explicit processing differences that explain the elicitation and differentiation of emotions.

Feeling
A feeling corresponds to thetypically conscioussubjective experience of an emotion by the individual. The measurement of this experience is mostly carried out using dedicated questionnaires with open or closed questions. It is by far the most widely used measure of emotion because it is easy to set up and it characterizes very finely the various conscious emotional states of the individual. Despite its numerous limitations (e.g., difficulty of introspection or social desirability), it is an essential measure of the conscious part of the emotional response because ultimately only the individual who feels the emotion can characterize what she is experiencing with the first person perspective. Many different questionnaires exist, and they are generally constructed according to the theory of emotions that their authors adopt (e.g., basic emotions, dimensional, or appraisal approaches). To be efficient, these questionnaires should take into account as much as possible the specificity of the field studied, for example, the set of emotions generated by music is not the same as the one generated by odors. The researcher who prepares a questionnaire to measure feelings is very often on the razor's edge. On one hand, proposing a long list of emotional terms so as not to miss out on an emotion poses logistical problems, especially for consumer studies where the questionnaires should not take too long to complete. On the other hand, if we use overly broad concepts such as liking, then we risk missing qualitative differences in emotions that have similar degrees of liking (which is often the case in consumer research). There is for instance an increased number of publications dealing with emotion-related questionnaires (e.g., scales) in response to different products. All these tools provide valuable information on emotional experience in different cultures or according to inter-individual variables.

Emotional concomitant, correlate, marker or invariant?
We have just discussed the fact that emotion is a multi-component construct and that it can be measured from several approaches. An important question when trying to qualify and quantify emotions is whether some of these measures are more informative than others. There are at least two ways in which we can consider the status of a measure with respect to emotions: a concomitant/correlate or a marker/ invariant. These terms correspond to degrees of association and specificity, from least to most, between any given measure and the construct "emotion" [see Cacioppo, Tassinary, & Berntson, 2007 for an introduction on the different relationship between different domains]. In a one-to-one specific relation, a given emotion at a given intensity (e.g., intense fear) would be associated with oneand one onlymeasurement pattern, and vice versa. Such a measure is called a marker (or an invariant when the association is context independent) and this is what any researcher would aim at ideally finding. But, to the best of our knowledge, there is no marker (in this strong sense) for any given emotion. An example of what one could think wrongly of as marker of happiness would be a smile. It is indeed the case that a smile is particularly associated with happiness (e.g., people smile more when they are happy than when they feel other emotions). However, some people under some circumstances may smile without being happy (e.g., an affiliative smile) or may be happy without smiling (e.g., when using a suppression strategy). It seems to us that one should be skeptical when told about the discovery of a singly (bio)marker of a given emotion (e.g., a given muscular activity or a the activity in a given brain region). The majority of relationships between emotions and measures are of the "many-to-one" type. This means that several cognitive processes, mental phenomena or even physiological states other than emotions will affect the measurement. This is referred to as concomitant or correlate. We can therefore say that each central or peripheral psychophysiological measure, each behavioral measure such as reaction times, accuracy, facial action unit, approach or avoidance movement constitutes a correlate of the emotion. There are other relationships between emotions and measures that are more informative. These relationships appear when one considers not one but several measures and/or components at the same time. We speak, for example, of a one-to-many relationship when a particular emotion is associated with a subset of measures of the different components. It is easy to understand that adding measures of the different components will make it more possible to characterize existing differences between emotions. Consider for example the distinction between fear and anger. The physiological measures/correlates to these two emotions could be very similar because they are both unpleasant and arousing. Adding a measure of action tendencies can theoretically differentiate between the two because fear is rather associated with avoidance of the eliciting situation while anger is rather associated with approaching the eliciting situation. Other components, such as the appraisal, the expressive, and the feeling components may certainly be useful too: in anger the coping potential is typically higher than in fear, a typical angry face shows more corrugator activity than a typical fearful face, and, obviously, people may report feeling either anger or fear. The ideal measurement situation is therefore one where the researcher can bring converging evidence across the components. Such setting increases the probability to detect the emotion and its intensity. Ideally, one would use measures that each add non-redundant qualitative and/or quantitative information to the others. Taken together, a given combination of measures could come closer to the notion of "marker" discussed above for a given emotion at a given intensity. Such an "emotional marker" pattern should not be thought of as a static snapshot of different measures but as a set of dynamic and coordinated variations of the different measures. This kind of approach is very underdeveloped as it requires both advanced analytical multivariate methods and theoretical assumptions to integrate measures from very different domains and timescales.

The issue of reverse inferences
As discussed in the first section of this paper, the temptation may be strong to make qualitative and quantitative inferences about the emotion based on one or more measures taken. Using only one type of measurement and interpreting it solely through the prism of emotion can lead to misinterpretation, related to what some researchers have called the "reverse inference fallacy". This fallacy was particularly influential in the context of neuroimaging studies (Poldrack, 2006). We can use the example proposed in the introduction to illustrate this notion. Consider using functional imaging (fMRI) to characterize the brain's response to fear. You find that in fear situations, the amygdala is more activated than in a neutral situation. You then deduce that the amygdala is involved in fear (i.e., forward inference). Can you then deduce that when the amygdala is activated, it shows that the participant was afraid? Answering yes to this question is a case of reverse inference fallacy. This inference is true if and only if the amygdala activation is specifically related to fear. In other words, this inference is true if and only if the activation of the amygdala is a marker of fear, which, as mentioned above, is not the case. In the same way, consider a researcher who presents several products, some of which induce the emotion of "interest" as verbally reported by the participant, and observes an increase in heart rate in response to the presentation of products that induce the emotion of "interest" compared to several products that induce other emotions. Heart rate therefore seems appropriate for measuring an emotional state related to product presentation (i.e., forward inference). The question then is: can he/she systematically deduces that when a product generates greater heart rate, then the participant is more interested? Of course not: heart rate is not a specific emotional marker, and even less a specific marker of "interest". Fluctuations in heart rate are also, for instance, indices of novelty detection, orienting, startle or defensive reactions, or effort. If the researcher relies solely on the heart rate to infer the emotional state, then there is a high risk of misinterpretation. This is at least why it remains essential, but not sufficient, to measure the subjective component of emotion in order to corroborate the information acquired at the level of the other components of emotion. In particular, if it is the case, as suggested by some multi-component theories, that the feeling components integrates the outcome of other components, then verbal report is a particularly interesting measure to consider (in particular if one can avoid interferences from social desirability). Keeping with this issue of emotions elicited by product, let us remember that the function of an emotion is to prepare the individual to (often quickly) react to a situation in an adapted manner. Each component of the emotion participates in this adapted response by fulfilling an important function for the response to be adapted (expression, physiological support, action tendencies etc…). When a component of emotion is measured, it therefore gives information about the function it reflects, not only about the emotion. It would therefore be particularly interesting not only to investigate what emotion may be generated by a product, but also to better understand how and to what extent are specific functional components (e.g. appraisal, feeling or action tendencies) modified by specificities of the product. Here are a few examples of more specific measures that could be envisaged: Measuring the physiological component could help to design products that are specifically activating or relaxing. Measurement of action tendencies could help to design products that promote approach, excitement or "being-with" tendencies. Setting up questionnaires not based on emotion labels but on the underlying evaluation criteria could make it possible to produce products that promote novelty, goal relevance, goal conduciveness or coping potential.

Conclusion
In conclusion, we would like to use this example of product development in order to highlight our two major claims: 1) the hypothesis-driven approach to emotion measurement is useful. There are fundamental questions that can support the hypothesisdriven approach: Why should my product generate specific emotions? What are the theoretical reasons to think that my product will generate for instance pride, elation, joy, satisfaction, relief, hope, interest, serenity, or surprise? What are the concerns, goals or needs that my product facilitate? Given that an emotion is typically triggered and differentiated according to evaluations made on a number of criteria (e. g., novelty, relevance, goal obstructiveness, coping potential, consistency with norms and values), can we link these criteria to the product?
2) measures of emotions are not markers of emotions, and the risk is high to make wrong reverse inferences when focusing blindly on a single measure; however, an increase of converging evidence obtained from the five components of emotion decreases this risk. As an example, consider an object designed to be interesting. When showed to a group of participants, this object elicits a stronger skin conductance response than other objects. Concluding that this object is more interesting because the skin conductance response is higher would be a risky reverse inference. However, consider the five components together: the participant shows an increased skin conductance response (autonomic physiology component), his/her eyebrows are lowered (expression component), he/she shows signs of approach (action tendencies), evaluates the object as unfamiliar, complex but understandable (appraisal), and reports verbally that he/she feels interested. The coherence between the outcomes of the responses increases the likelihood that the person feels the emotion of "Interest" when presented with the given object.
Although we focused on emotions in this opinion paper, we believe that much of the rationale that we defend here also applies to other affective phenomena such as well-being or stress for which the hypothesis-driven search for converging evidence across components may be a protective factor against risky reverse inferences.