Context-prosody interaction in sarcasm comprehension: A functional magnetic resonance imaging study

enhanced this rating. When the context was negative, the positive prosody effect disappeared, while negative prosody increased the sarcasm rating. Thus, context ‒ content incongruity is the primary determinant of sarcasm comprehension; and is modified by prosody in a context-dependent manner. Neuroimaging results showed that the context ‒ content incongruity effect was notable in the cerebellum and the mentalizing network, representing what was uttered in a particular context. The content ‒ prosody incongruity effect was observed in the bilateral amygdala, representing the manner of utterance. The interaction between these incongruity effects was found in the bilateral dorsolateral prefrontal cortex, extending to the inferior frontal gyrus and the salience network, including the anterior insular cortex and the caudal part of the dorso-medial prefrontal cortex. These findings indicate that two distinct incongruity detection systems for sarcasm comprehension are integrated in the prefrontal cortices through the salience network.


Introduction
Sarcasm mediates implicit criticism of the listener by provoking negative emotions with disapproval, contempt, and scorn. The term sarcasm is a subcategory of the term irony, which aims to transfer the intention that cannot be conveyed literally, either for positive goals (e. g., humor, emphasis) or for opposing goals (e.g., sarcasm, criticism) (Utsumi, 2000). This study focuses on sarcasm. Speakers express negative emotion (e.g., disappointment, anger, reproach, envy) regarding the difference between their expectations and failure. They implicitly display these emotions by various means: an allusion to the expectation, a pragmatic insincerity by intentionally violating one of the pragmatic principles, and an indirect expression of the negative emotion toward failure to meet their expectation (Utsumi, 2000). During the conversation, sarcasm is perceived as a multi-layered incongruity among the utterance's context, content, and prosody. The incongruity between the context and the content of utterance enables the hearer to understand the ironic intent of the speaker (Ackerman, 1983;Colston, 2002;Ivanko and Pexman, 2003;Katz and Lee, 1993;Katz and Pexman, 1997;Kreuz and Glucksberg, 1989). The larger the disparity between the content and the discourse context is, the more disapproval is perceived in the irony by the hearer (Colston and O'Brien, 2000;Gerrig and Goldvarg, 2000). Previous studies pointed out the importance of prosody in ironic understanding (Bryant and Fox Tree, 2005;Caillies et al., 2019;Capelli et al., 1990;Laval and Bert-Erboul, 2005;Le Gall and Iakimova, 2018;Tobe et al., 2016;Wickens and Perry, 2015). The ironic prosody is often enough for a hearer to identify an utterance as an instance of irony (Bryant and Fox Tree, 2005;Capelli et al., 1990). As children first use ironic prosody as an effective clue for irony comprehension at around the age of 5 years, before they can also make use of the discourse contexts as a clue (Laval and Bert-Erboul, 2005), ironic prosody is a distinctive cue for irony comprehension. Furthermore, the ironic prosody interacts with the discourse context in sarcasm comprehension (Rivière et al., 2018;Voyer et al., 2016;Woodland and Voyer, 2011).
Consider the following example from Utsumi (2000): A mother asked her son to clean up his messy room, but he was lost in a comic book. After a while, she discovered that his room was still messy, and said to her son: "This room is totally clean!" The son initially assumed that his mother spoke directly and honestly. However, when he recognized the incongruity between what he had expected to hear from his mother (i.e., the discourse context) and what he had heard (i.e., the utterance content), he was geared toward resolving context-content incongruity. The hearer also recognizes incongruity by detecting adverse effects displayed implicitly by paralinguistic cues such as affective prosody, facial expression, and verbal cues such as hyperbole (Utsumi, 2000).
The "implicit display theory" (Utsumi, 2000) holds that two factors are critical for an utterance to be interpreted ironically. The first factor is the ironic environment in which the discourse situation is located. The second factor is the negative attitude that is implicitly displayed by the speaker. The theory suggests that comprehension of sarcasm requires at least one of these factors, while another factor enhances this understanding (Utsumi, 2000). Within this theoretical framework, contextcontent incongruity corresponds to the implicit display of an ironic environment, and content-prosody incongruity corresponds to the indirect expression of negative attitudes. In other words, this theory predicts that the degrees of sarcasm due to context-content incongruity and content-prosody incongruity additively determine the overall degree of sarcasm (Utsumi, 2000). For example, the negative context of a messy room in the example above anticipates the mother's utterance with some negative content. However, in reality, when a mother's utterance with positive content is received, the utterance acts in a directional effect that increases the degree of sarcasm due to pragmatic insincerity, i.e., the context-content incongruity. When the utterance has negative prosody, the utterance acts in a directional effect that increases the degree of sarcasm due to the indirect expression of negative attitude, i.e., the content-prosody incongruity. As a result, the two effects additively provide the strongest sarcasm (Utsumi, 2000). When the utterance has positive prosody, the utterance acts in a directional effect that decreases the degree of sarcasm. The effects of the two different directions may cancel each other out, and the degree of sarcasm is weaker than in negative prosody. However, in the absence of context-content incongruity, the degree of sarcasm will be determined solely by the content-prosody incongruity. Importantly, sarcasm is understood as a clue to some incongruity, and the degree of sarcasm is related to the additive result of each incongruity (Utsumi, 2000).
Regarding negative attitudes that is the characteristics of sarcasm (Utsumi, 2000), previous studies have reported activation of the amygdala, which suggests emotional processing related to social behavior (Akimoto et al., 2014;Uchiyama et al., 2012). The amygdala forms part of the relevance network (Nakamura et al., 2018;Ousdal et al., 2008;Sander et al., 2003). The affective prosody network consists of the medial prefrontal cortex (MPFC), IFG, and the anterior insula (AI) (Frühholz et al., 2016). Thus, during a conversation, the affective prosody network is expected to interact with some areas implicated in mentalizing and semantics. In the auditory modality, Matsui et al. (2016) found a context-prosody incongruity effect in the left IFG, with an additional negative prosody effect in the anterior cingulate cortex (ACC) and bilateral AI. The ACC and AI comprise a saliency network that marks salient events (Seeley et al., 2007) and initiates appropriate control signals for additional processing, such as the maintenance and implementation of task sets (Dosenbach et al., 2006;Nelson et al., 2010) and the coordination of behavioral responses (Medford and Critchley, 2010), recruiting the executive network (Menon and Uddin, 2010;Sridharan et al., 2008). In this context, Matsui et al. (2016) speculated that the salience network is involved in the detection of content-prosody incongruity, which in turn initiates further processing involving context incorporation, which is mediated by the rostral IFG (BA 47). It remains unclear, however, how these two factors, context-content incongruity and content-prosody incongruity, are integrated to improve sarcasm understanding; moreover, its neural underpinning remains unknown. As the majority of the neuroimaging studies of sarcasm comprehension adopted reading materials, the context-prosody interaction has rarely been investigated (Matsui et al., 2016;Uchiyama et al., 2006).
Based on the abovementioned findings, we hypothesized that content-prosody incongruity modifies the context-content incongruity effect. We considered that the areas implicated in semantic processes would integrate these effects through the salience network. To test this hypothesis, we conducted a functional magnetic resonance imaging (fMRI) study with an auditory sarcasm detection task. We extended the auditory modality-based task utilized in a previous study (Matsui et al., 2016, in which the participants observed daily conversational interactions wherein a child did either good or bad, about which a parent made a positive comment), in two ways. First, Matsui et al. (2016) previously applied experimental stimuli that can be used by both children and adults in an experiment involving adults; thus, the provided context might have been too easy for adults. In this study, we made the context situation more complex, to enhance involvement of the mentalizing network. Second, we converted the materials to the second person, that is, critical comments were directed at the participants themselves, rather than at a character in a scenario, thus making the situation more conducive to sarcasm comprehension.

Participants
Twenty-three healthy individuals participated in the study as paid volunteers for the fMRI experiment. One participant was excluded due to high rates of response errors in the judgment phase, leaving 22 participants for the final analysis (11 women and 11 men; mean age, 21.7 years; range, 19-36 years). All participants had normal or corrected-tonormal visual acuity, and were right-handed (including one who was ambidextrous) according to the Edinburgh handedness inventory (mean score: 80.5; range, 11.1-100) (Oldfield, 1971), and had no history of neurological or psychiatric illness.
All participants provided written informed consent for participation in the study. This study was approved by the Ethics Committee of the National Institute for Physiological Sciences, Japan. The experiments were undertaken in compliance with national legislation and the Code of Ethical Principles for Medical Research Involving Human Subjects of the World Medical Association (Declaration of Helsinki).

Auditory discourse task
The auditory discourse task consisted of two parts: the first (S1) was a three-sentence story that explained the situation of the participant who was in the scanner (you), whereas the second (S2) commented on you. S2 represented the target sentence. The target sentence could be interpreted differently, depending on the context provided in S1 (Mano et al., 2009;Matsui et al., 2016;Uchiyama et al., 2012) and the prosody of S2. We manipulated the context via S1 to make sentence S2 either congruent or incongruent with the context. Furthermore, S2 presented positive, negative, or monotone prosody, while the prosody of S1 was always monotonous. Thus, during fMRI scanning, participants listened to short narratives (S1), followed by a target sentence (S2). The task was to judge how sarcastic the target sentence S2 sounds. An example of an everyday conversation is given below. S1. (1) You and your friend sing together in the same opera. (2) During your performance, you often sing off-key. (3) After the show, your friend says to you: S2. (4) "Tonight, you gave a superb performance." In this example, S1 indicates that you (fMRI participant) were in error. In this context, the friend's comment on S2 should be interpreted as an example of sarcasm. In contrast, in the following example, S1 provides a positive situation, and the same S2 should be interpreted literally. S1. (1) You and your friend sing in the same opera. (2) The show was excellent, and you received a long ovation. (3) After the show, your friend says to you: S2. (4) "Tonight, you gave a superb performance." We fixed the valence of the S2, the uttered contents, as positive. Thus, there were six patterns of context-content-prosody combination for S2 (abbreviations: P, positive; N, negative; and m, monotone): PPP, PPN, PPm, NPP, NPN, NPm, creating a 2 (Context) × 3 (Prosody) design matrix. All auditory stimuli were in Japanese.

Recording of auditory stimuli
The stimuli were recorded by a female actor in a silent room, using a microphone (SM58; Shure, Evanston, IL), an audio interface (0202; E-MU Systems, Scotts Valley, CA), and a personal computer (ThinkPad X201; Lenovo, Morrisville, NC).

Norming study
To confirm that our experimental stimuli with these auditory utterances were indeed perceived as sarcasm in the negative context (NPP, NPN, and NPm) conditions, 45 volunteers (28 females and 17 males; mean age, 36.5 y; range, 22-64 years) participated in a norming study. We presented the experimental stimuli in a pseudo-random order and asked the participants to rate how sarcastic the partner's comment sounded in S2, using a five-point scale (5 = sarcasm; 1 = literal, i.e., not sarcastic). The mean scores of these sarcasm judgments were ( Fig. 1(a)): 1.06 for the PPP condition, 2.88 for the PPN condition, 1.73 for the PPm condition, 3.87 for the NPP condition, 4.94 for the NPN condition, and 4.18 for the NPm condition. A two-way analysis of variance (ANOVA) of the discourse context (positive and negative) and the affective prosody (positive, negative, and monotone) with a Bonferroni's correction for multiple comparisons conducted on the sarcasm judgments revealed that the main effects of discourse context (F (1, 44) = 391.247, mean squared error (MSE) = 1.029, p < 0.001), affective prosody (F (2, 88) = 78.672, MSE = 0.619, p < 0.001) and the interaction between discourse context and affective prosody (F (2, 88) = 9.907, MSE = 0.316, p < 0.001) were significant. The nature of this interaction was such that when a discourse context involved a positive event, an utterance with negative prosody was judged as significantly more sarcastic than one with monotone prosody (Bonferroni's, p < 0.001), which was more sarcastic than the one with positive prosody (Bonferroni's, p < 0.001) (F (2, 176) = 81.507, MSE = 0.467, p < 0.001). However, for a discourse context involving a negative event, an utterance accompanied by negative prosody was judged as significantly more sarcastic than an utterance with monotone prosody (Bonferroni's correction for multiple comparisons, p < 0.001) which was more sarcastic than one with positive prosody (Bonferroni's, p < 0.05) (F (2, 176) = 29.386, MSE = 0.467, p < 0.001). These results demonstrated that our experimental stimuli were well-controlled in that the S2 with positive content was perceived as sarcastic in the negative context.

Fig. 1. (a)
In a norming study, the mean scores of sarcasm judgments (5 = sarcasm; 1 = literal, i.e., not sarcasm). In an fMRI study, (b) the mean scores of sarcastic responses (7 = sarcasm; 1 = literal, i.e., not sarcasm) and (b') the increment of sarcasm ratings from monotone prosody. Error bars indicate the standard error of the mean. Conditions (see Table 1 for details) with a positive context are indicated by green, and those with negative context by orange. The main effect of context, prosody, and their interaction were significant. Abbreviations: P, positive; N, negative; m, monotone. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

fMRI procedures
Before the fMRI session, the participants were given detailed instructions on the task procedure. To familiarize participants with the task, they were provided with examples of stimuli that did not appear during the fMRI session. All task stimuli were prepared and presented using Presentation® 19.0, software (Neurobehavioral Systems, Albany, CA) running on a personal computer (dc7900; Hewlett-Packard Inc., Palo Alto, CA). Using a liquid crystal display projector (CP-SX12000J; Hitachi Ltd., Tokyo, Japan), the visual stimuli were projected onto a half-transparent viewing screen located behind the head coil of the MRI scanner. The participants viewed the stimuli using a mirror attached to the head coil. The spatial resolution of the projector was 1024 × 768 pixels, and used a 60-Hz refresh rate. The distance between the screen and the participant's eyes was approximately 175 cm, and the visual angle was 13.8 • (horizontal) × 10.4 • (vertical). Auditory stimuli were presented via MR-compatible headphones (Kiyohara-Kougaku, Tokyo, Japan).
Participants were instructed to play the role of the protagonist in a scenario, in which the protagonist had a short conversation with another person (colleagues, etc.). Within the scenario, the participant was the person who was evaluated (positively or negatively) by the utterance. In other words, there was no process of empathy for a third person who was evaluated by the sarcastic utterance; as a result, the experimental situation was more suitable for depicting the neural substrates of sarcasm. Each of the four phases of each trial was set to correspond to the length of the presented sentences (the first phase, 4.5 s [range: 3.5-5.3 s]; the second phase, 4.6 s [range: 3.0-6.6 s]; the third phase, 2.4 s [range: 2.0-3.7 s]; and the fourth phase (S2), 2.7 s [range: 1.7-3.8 s]), followed by presentation of a fixation cross on a black screen (visual angle, 0.6 • × 0.6 • ) for 0.9 s. Thereafter, when determining the degree of sarcasm, the instruction was to integrate the contextual effect and the prosodic effect. That is, the participant was required to judge how sarcastic the S2 utterance sounded, and respond to a seven-point scale (7 = sarcasm; 1 = literal, i.e., not sarcastic) that appeared on the screen for 4 s (judgment phase), as quickly as possible, by pressing a button with their right index or middle finger. After the participant responded, a fixation cross was again shown on the screen for 4 s (range: 1.5-6.5 s).
An event-related design was used in this study. Each run consisted of the first 25 s of a fixation cross, followed by 24 task trials, and the final 15 s of a fixation cross. The 72 task trials (12 scenarios × two discourse contexts × three affective prosodies [positive, negative, and monotone]) were presented in a pseudo-random order; that is, all context-prosody combinations were presented. The experiment consisted of three runs, each consisting of 24 experimental trials. The presentation order of the three runs was counterbalanced across participants.
All images were acquired using a 3-T MR scanner (Verio®; Siemens, Erlangen, Germany). An ascending T2*-weighted gradient-echo echoplanar imaging (EPI) procedure was used to produce 66 continuous trans-axial slices that covered the entire cerebrum and the cerebellum (repetition time [TR], 1000 ms; echo time [TE], 30.8 ms; flip angle, 55 • ; field-of-view [FoV], 216 mm; 216 × 216 matrix; voxel dimensions, 2.4 × 2.4 mm in plane; 2.0-mm slice thickness with a 0.4-mm gap). Six slices were acquired simultaneously using a multi-band sequence (Moeller et al., 2010) to maximize the acquisition speed. Oblique scanning was used to exclude the eyeballs from the images. Each run consisted of a continuous series of 640 vol acquisitions, resulting in a total acquisition duration of 10 min 40 s. For anatomical imaging, T1-weighted magnetization-prepared rapid-acquisition gradient echo anatomical images were also obtained (TR, 2400 ms; TE, 2.24 ms; flip angle, 8 • ; FoV, 256 mm; 1 slab; number of slices per slab, 224; voxel dimensions, 0.8 × 0.8 × 0.8 mm) for each participant.

Data analysis 2.4.1. Behavioral performance
A two-way ANOVA with two within-subject factors, namely discourse context and affective prosody, with a Bonferroni's correction for multiple comparisons, was conducted on the seven-point scale of how sarcastic the utterance in S2 sounded. The analysis was performed using SPSS® version 26.0 software (IBM, Armonk, NY).

Imaging data
Imaging data were preprocessed as follows. The 640 EPI volumes per run (a total of 1920 EPI volumes per participant) were analyzed using Statistical Parametric Mapping 12 (SPM12; Wellcome Department of Imaging Neuroscience, London, UK; Friston et al., 2007) implemented in MATLAB® (MathWorks, Natick, MA). The EPI volumes were spatially realigned to correct for head motion. Next, a mean EPI image was coregistered with the T1-weighted anatomical image, and the parameter was then applied to all EPI images. The anatomical image was spatially normalized to the Montréal Neurological Institute (MNI) T1 template using a segmentation-normalization method. The normalization parameters of the T1-weighted anatomical image were applied to all EPI volumes, and then spatially smoothed in three dimensions using an 8 mm full-width half-maximum Gaussian kernel.
After preprocessing, the EPI data obtained for each participant were analyzed using a general linear model. The S2 of the six auditory conditions (PPP, PPN, PPm, NPP, NPN, and NPm) were separately modeled by convolution with a hemodynamic response function. S1 was also modeled as a regressor of no interest convolved with a hemodynamic response function. Additionally, button responses were modeled as independent regressors using a convolved delta function. We applied highpass filters (128 s) to the time-series data. An autoregressive model was used to estimate the temporal autocorrelation. Six regressors for the head movement parameters obtained in the realignment process were entered into the model.
The contrast images, which consisted of the weighted sum of parameter estimates and which represented the normalized task-related increment of the MR signal obtained in the individual analyses, were then subjected to group analysis. In total, data from 22 participants and six contrasts (PPP, PPN, PPm, NPP, NPN, and NPm) were incorporated into the 2 (discourse context) × 3 (affective prosody) within-subject factorial design (Friston et al., 2007). Using the flexible factorial design model (Friston et al., 2007), the subject factor was set as independent to take different individuals into account. The error variance was set to be equal across participants because they were sampled from the same underlying population. In contrast, the two condition factors were set as dependent because the different factor levels were correlated within the subject, with equal error variances because they were taken from the same subject. Given our hypotheses, we evaluated the following predefined contrasts of t-tests (Table 1): The resulting set of voxel values for each contrast constituted statistical parametric mapping (SPM) of the t statistic, which was transformed into normal distribution units with a height threshold set at  uncorrected p < 0.001 and p < 0.05 with a family-wise error (FWE) correction at the cluster level for the entire brain (Friston et al., 1996). Since the amygdala shows prominent activation of a small region with sharp boundaries, however, to observe the amygdala alone, we also exploratorily used a more conservative FWE correction at the voxel level of p < 0.05 (Friston et al., 1996). The activated area was determined by MarsBaR version 0.44 software (Brett et al., 2002) along with MarsBaR AAL ROIs version 0.2 software (Tzourio-Mazoyer et al., 2002) (http://marsbar.sourceforge.net); and the locations were checked using the Atlas of the Human Brain (Mai et al., 2015).

Behavioral performance
As expected, the partner's comment for a negative event (NPP, NPN, and NPm conditions) was interpreted as sarcasm, whereas the partner's comment for a positive event (PPP, PPN, and PPm conditions) was interpreted as literal. In other words, the mean scores of sarcastic responses (7 = sarcasm; 1 = literal, i.e., not sarcasm) were 1.37 for PPP condition, 3.91 for PPN condition, 2.44 for PPm condition, 5.40 for NPP condition, 6.82 for NPN condition, and 5.48 for NPm condition ( Fig. 1  (b)). A two-way ANOVA of discourse context (positive and negative) and affective prosody (positive, negative, and monotone) conducted on the sarcasm judgments revealed a significant main effect of discourse context (i.e., the effect of context-content incongruity) (F (1, 21) = 404.153, MSE = 0.903, p < 0.001), a significant main effect of affective prosody (i.e., the effect of content-prosody incongruity) (F (2, 42) = 132.089, MSE = 0.345, p < 0.001), and a significant interaction between these two factors (i.e., the effect of context-prosody incongruity) (F (2, 42) = 17.938, MSE = 0.231, p < 0.001). The nature of this interaction was such that when a discourse context involved a positive event, an utterance with negative prosody was judged as significantly more sarcastic than one with monotone prosody (Bonferroni's, p < 0.001), which was more sarcastic than one with positive prosody (Bonferroni's, p < 0.001) (F (2, 84) = 124.093, MSE = 0.288, p < 0.001). However, for a discourse context involving a negative event, an utterance accompanied by negative prosody was judged as significantly more sarcastic than an utterance with monotone or positive prosody (Bonferroni's correction for multiple comparisons, p < 0.001; there was no significant difference between monotone and positive prosody) (F (2, 84) = 48.542, MSE = 0.288, p < 0.001).
To reveal how prosody modifies the context-content incongruity effect, we calculated the increment of sarcasm ratings from monotone prosody ( Fig. 1(b')). The increment was − 1.07 for [PPP -PPm], 1.47 for [PPN -PPm], − 0.08 for [NPP -NPm], and 1.34 for [NPN -NPm]. A twoway ANOVA of discourse context and affective prosody showed that the main effect of discourse context (F (1, 21) = 15.386, MSE = 0.267, p < 0.001), affective prosody (F (1, 21) = 141.947, MSE = 0.606, p < 0.001), and their interaction (F (1, 21) = 18.546, MSE = 0.373, p < 0.001) were significant. The nature of this interaction was such that an utterance with negative prosody had a significantly larger increment of sarcasm ratings than an utterance with positive prosody regardless of the discourse context (F (1,42) = 144.775, MSE = 0.489, p < 0.001 for a positive context; and F (1, 42) = 45.112, MSE = 0.489, p < 0.001 for a negative context). In contrast, when an utterance involved positive prosody, the degree of increment of sarcasm ratings was negative for both discourse contexts, and a discourse context involving a negative event showed a significantly smaller decrease than a discourse context involving a positive event (F (1, 42) = 33.886, MSE = 0.320, p < 0.001). However, when an utterance was accompanied by negative prosody, the degree of increment was positive for both discourse contexts and there was no significant difference between a discourse context involving a negative event and one involving a positive event (F (1, 42) = 0.571, MSE = 0.320, not significant). In other words, when the context was positive (congruent with the content of utterance), positive prosody lessened the sarcasm rating, whereas negative prosody enhanced it. When the context was negative, the positive prosody effect disappeared, while negative prosody increased the sarcasm rating. Thus, contextcontent incongruity is the primary determinant of sarcasm comprehension, and is modified by prosody in a context-dependent manner.

Discussion
The present study investigated the neural mechanisms underlying sarcasm comprehension during conversation, and found that it required two distinct incongruity detection systems: the mentalizing network and the relevance network, which are integrated in the region implicated in the semantic processes, including the prefrontal cortices, through the salience network.

Behavioral results
When the context was positive (congruent with the content of the utterance), positive prosody lessened the sarcasm rating, whereas negative prosody enhanced this rating. When the context was negative, the positive prosody effect disappeared, while negative prosody increased the sarcasm rating. Thus, context-content incongruity is the primary determinant of sarcasm comprehension and is modified by prosody in a context-dependent manner. This finding reflects the fact that the context precedes the utterance. That is, according to the predictive coding schema (Friston, 2005), the context provokes the generative model that predicts the upcoming event, and if there is incongruity between these factors, it results in prediction error. Thus, the context-induced generative model is the major determinant of sarcasm rating that is understood as the prediction error. The participants judged the prediction error, the deviation from the literal, as sarcastic degree. For example, PPN items provided positive context that makes listeners predict the positive attitude of the speaker. Thus, negative prosody generates prediction error reflected in the increased sarcasm rating compared with PPm or PPP, both of which are concordant with the prediction of positive attitude of the speaker. Sarcasm mediates implicit criticism of the listener by provoking negative emotions (Sperber and Wilson, 1995). As negative prosody is the most direct means to display a critical cue of a negative attitude of the speaker, the increased sarcasm rating should be interpreted as discrepancy between the expected (positive) and actual (negative) attitudes of the speaker, that is, sarcasm.

Context-content incongruity
We confirmed the context-content incongruity effect, representing what is uttered in a particular context, in the mentalizing network (Frith and Frith, 2003) including the arMPFC, TP, and cerebellum. This finding is consistent with those of previous studies showing the importance of context in ironic understanding (Ackerman, 1983;Colston, 2002;Ivanko and Pexman, 2003;Katz and Lee, 1993;Katz and Pexman, 1997;Kreuz and Glucksberg, 1989). Activation of the MPFC and TP is consistent with the findings of previous neuroimaging studies of sarcasm comprehension during a reading task (Uchiyama et al., 2006(Uchiyama et al., , 2012 and when using the auditory modality (Herold et al., 2018;Varga et al., 2013;Wang et al., 2006a). Activation of the mentalizing network during sarcasm detection represents the prediction error, that is, the incongruity between the prediction of the meaning of the utterance of the speaker that was pragmatically relevant to the context, and the actual utterance. Cerebellar activation is consistent with the previous review study showing that the Crus I and Crus II in the cerebellum are involved in social cognition, particularly in mentalizing processing related to emotional cognition in the Crus II (Van Overwalle et al., 2020).

Content-prosody incongruity
We also confirmed that content-prosody incongruity in sarcasm comprehension is related to activation in the bilateral amygdala. This finding is consistent with those of previous studies showing the importance of prosody in ironic understanding (Bryant and Fox Tree, 2005;Caillies et al., 2019;Capelli et al., 1990;Laval and Bert-Erboul, 2005;Le Gall and Iakimova, 2018;Tobe et al., 2016;Wickens and Perry, 2015). It is also consistent with those of previous neuroimaging studies reporting the involvement of the amygdala in sarcasm-specific process (Akimoto et al., 2014;Uchiyama et al., 2012) and affective sound processing in the detection of emotional and social valence (Frühholz et al., 2016). Furthermore, Nakamura et al. (2018) found that humor comprehension activated the left amygdala. Sarcasm comprehension was similar to humor comprehension in that it consists of both cognitive and emotional components of incongruity resolution. Given that the amygdala is involved in relevance detection (Ousdal et al., 2008;Sander et al., 2003), it is conceivable that the amygdala is involved in the contentprosody incongruity process.

Incongruity interaction
We found an interaction between context-content incongruity and content-prosody incongruity in the salience network, which extended to the IFG and DLPFC, consistent with our pre-experiment hypothesis that the regions implicated in semantic processes would be involved in integrating the main effects of incongruities through the salience network. This finding is consistent with those of previous studies showing the importance of prosody in ironic understanding (Rivière et al., 2018;Voyer et al., 2016;Woodland and Voyer, 2011).
Activation of the prefrontal regions was also consistent with the results of previous studies. The IFG is the main neural basis for affective prosody networks (Frühholz et al., 2016) along with a unification space (Hagoort, 2005). Uchiyama et al. (2006) concluded that the IFG, particularly BA 47, is the site where integration of semantic and mentalizing processes occur during sarcasm detection. We also found activation in the DLPFC, which has also been implicated in sarcasm comprehension (Bosco et al., 2017;Spotorno et al., 2012;Varga et al., 2013) and has been suggested to be involved in non-literal language processing (Rapp et al., 2012) and executive functioning to integrate information related to sarcasm comprehension (Bosco et al., 2017).
The AI is critical for empathy (Carr et al., 2013). A meta-analysis of emotion activation studies (Phan et al., 2002) showed that the ACC and AI, the core nodes of the salience network, are involved in emotional recall or imagery of personally relevant events and emotional tasks that exert a cognitive demand. The amygdala and AI are co-activated for interpreting emotional facial expressions (Phan et al., 2002). For empathy to arise through processing of emotional facial expression, the AI is critical in the communication between the action representation network and the limbic areas, including the amygdala (Carr et al., 2013). Thus, the salience network is involved in emotional comprehension by linking with the relevance detection system.
Compared with a previous study by Matsui et al. (2016) that showed the content-prosody incongruity effect, we used a complex context, which might enhance the effect of interaction between context-content incongruity and content-prosody incongruity in the salience network. Another factor was that we used the second-person perspective in the experimental setup: sarcastic comments were directed at the participants themselves rather than at a character in a scenario, thus making the situation more self-related for sarcasm comprehension. The salience network, including the AI and ACC, is known to be involved in   Table 1(b)) in sarcasm comprehension, superimposed on the coronal (y = − 4 mm; middle) section of the standard MNI template. Bar graphs of the task related activation in the left (MNI coordinates [-14 -4 -16]; left) and right ([20 -4 -10]; right) amygdala are plotted in the same format as in Fig. 2, except for monotone prosody indicated by gray.
self-related processing, such as self-face processing. Individuals can experience embarrassment when exposed to self-feedback images, depending on the extent of divergence from the internal representation of the standard self. By utilizing the fact that embarrassment is enhanced by observation by others, Morita et al. (2014) showed that the AI and ACC are involved in emotional processing for self-face recognition. More specifically, they found enhanced functional connectivity of the ACC with the dorsal and ventral parts of the MPFC, and the left lateral prefrontal cortex, including the middle frontal gyrus, IFG, and AI, when viewing self-face images while being observed than when doing so without observation. They argued that this enhanced connectivity of the ACC with the MPFC and the left prefrontal cortex represented processing of the reflective self or narrative self (Gallagher, 2000). In contrast, the right AI appears to be involved in creating the subjective experience of embarrassment, probably through the comparison with the "standard self" (Morita et al., 2014). This notion is consistent with a "predictive coding" framework (Seth, 2013;Seth and Friston, 2016) that postulates the prediction error caused by comparison of the prior model with the perception of the visually presented self-face evoked emotion. In a similar vein, during sarcasm comprehension in the present experiment, the salience network was involved in self-relevant, self-evaluative emotional processing triggered by the sarcastic utterance directed toward the participant.

Two distinct incongruity detection systems integrated by salience network
These results suggest that sarcasm is perceived as a multi-layered incongruity among the context, content, and prosody of the utterance (Utsumi, 2000). Our findings indicated that there are two distinct incongruity detection systems and an integration system corresponding to the implicit display theory for sarcasm comprehension (Utsumi, 2000). One is the mentalizing system (Frith and Frith, 2003;Spotorno et al., 2012;Uchiyama et al., 2006Uchiyama et al., , 2012 involving the cerebellum for context-content incongruity (i.e., what is uttered in a particular context), which corresponds to a pragmatic insincerity by intentionally violating one of the pragmatic principles (Utsumi, 2000). The other is the relevance detector system (Nakamura et al., 2018;Ousdal et al., 2008;Sander et al., 2003) for content-prosody incongruity (i.e., how it is spoken), which corresponds to an indirect expression of the negative effect toward the failure to meet their expectation (Utsumi, 2000). An interaction effect of these systems with the salience network was found in the prefrontal regions (Menon and Uddin, 2010;Sridharan et al., 2008), including the AI and ACC. This finding indicates that, for sarcasm comprehension, different incongruity sources are integrated in the prefrontal cortices through the salience network.

Clinical implication
It is well known that individuals with Autism Spectrum Disorder (ASD) have difficulty in appreciating irony (Happé, 1993(Happé, , 1994Kaland et al., 2002Kaland et al., , 2005Leekam and Prior, 1994;Martin and McDonald, 2004;Tantam, 1991). This impairment could be related to deficits in using both prosodic and contextual information to make inferences about a speaker's communicative intent (Wang et al., 2006b). Wang et al. (2006b) showed using fMRI that children with ASD showed significantly greater activity than control group in the right IFG and bilateral temporal region, suggesting "more effortful processing needed to interpret the intended meaning of an utterance." In contrast, a meta-analysis of the functional MRI with individuals with ASD showed hypoactivation of the AI by the social tasks (Di Martino et al., 2009). Given that the AI is positioned as "a hub mediating interactions between large-scale networks involved in externally and internally oriented cognitive processing," Uddin and Menon (2009) concluded that dysfunctional AI connectivity plays an important role in ASD. Consistent with these previous studies, the present study revealed that the AI, along with the lateral prefrontal cortices, is related to integrating both prosodic and context information. To directly elucidate the pathogenesis of ASD individuals, an fMRI study of individuals with ASD applying the present verbal sarcasm comprehension task is warranted for future study.

Study limitations
This study had some limitations. For example, to maintain the total duration of the fMRI experiment within a reasonable interval, we restricted our experimental stimuli to those with moderate difficulty. Future research should investigate the neural mechanisms underlying different types of neural bases for incongruity detection in sarcasm comprehension based on an auditory modality using stimuli of multilevel difficulties. For the same reason, we also restricted our target utterances to those with a positive semantic valence, and it may have been possible to predict whether the utterance was sarcastic at the contextual phase. Thus, future research should include those with negative semantic valence. We did not conduct connectivity analysis among several networks such as the relevance network, affective prosodic network, mentalization network, the regions implicated in semantic processes, salience network, executive network, and default mode network. The function of the salience network in the connection among the executive network and default mode network has been proposed (Menon and Uddin, 2010). The detailed interaction among these network warrants future study. As shown in Table 2, two thresholds were used together to show the results of the imaging data. Since the amygdala is a relatively small neural substrate, it may have been difficult for the cluster size to grow. This may be related to the number of participants in this study (22). Some other potentially interesting questions regarding incongruity detection in sarcasm comprehension were excluded from the current study. For instance, a theoretical study suggested that a hearer detects incongruity not only by affective prosody (paralinguistic cues), but also by verbal cues, such as hyperbole and interjection, and speech that expresses counterfactual pleased affect, as well as by nonverbal cues, such as facial expression and behavioral cues (Utsumi, 2000). Future research should investigate these neural mechanisms using visual and auditory stimuli.

Conclusions
This study revealed that two distinct incongruity detection systems for sarcasm comprehension are integrated in the prefrontal cortices through the salience network.

Data availability
The datasets generated in this study are available from the corresponding author on reasonable request.

Declaration of competing interest
No potential conflict of interest was reported by the authors.