Reality monitoring and metacognitive judgments in a false-memory paradigm

How well do we distinguish between different memory sources when the information from imagination and perception is similar? And how do metacognitive (confidence) judgments differ across different sources of experiences? To study these questions, we developed a reality monitoring task using semantically related words from the Deese-Roediger-McDermott (DRM) paradigm of false memories. In an orientation phase, participants either perceived word pairs or had to voluntarily imagine the second word of a word pair. In a test phase, participants viewed words and had to judge whether the paired word was previously perceived, imagined, or new. Results revealed an interaction between memory source and judgment type on both response rates and confidence judgments: reality monitoring was better for new and perceived (compared to imagined) sources, and participants often incorrectly reported imagined experiences to be perceived. Individuals exhibited similar confidence between correct imagined source judgments and incorrect imagined sources reported to be perceived. Modeling results indicated that the observed judgments were likely due to an externalizing bias (i.e., a bias to judge the memory source as perceived). Additionally, we found that overall metacognitive ability was best in the perceived source. Together, these results reveal a source-dependent effect on response rates and confidence ratings, and provide evidence that observers are surprisingly prone to externalizing biases when monitoring their own memories.

Human experiences originate from two possible sources: the surrounding sensory world and our own imaginations.Both types of experiences result in the formation of memories, which correspond to externally generated memories (from perception) or internally generated memories (from imagination).Our cognitive ability to distinguish between the sources of these two memory types is called reality monitoring (Johnson and Raye, 1981;Dijkstra et al., 2022;Lau, 2019;Simons et al., 2017).Many questions about reality monitoring currently remain unanswered.For example, how is our reality monitoring of memories influenced when stimuli are closely related?And how is our sense of confidence in memories influenced in such scenarios?In this investigation, we pursue answers to these two questions.
The reality monitoring of memories has been studied from a variety of perspectives, including investigations of how observers distinguish between heard & imagined words (Sugimori et al., 2014), how sensory modalities (visual or auditory) and actions (thinking or speaking) modulate reality monitoring (Garrison et al., 2017), the influence of response delays on task performance (Johnson et al., 1994), how physically or conceptually familiar and similar objects influence source judgments (Henkel and Franklin, 1998), and how emotions impact reality monitoring (Kensinger and Schacter, 2006b;Wang, 2018).Reality monitoring has also been studied in neuro-atypical individuals with OCD and schizophrenia, who have shown lower accuracy in reality monitoring tasks (Cougle et al., 2008;Hermans et al., 2003;McNally and Kohlbeck, 1993;Moritz et al., 2003;Vinogradov et al., 2008).A common paradigm involves linguistic material, with participants required to fill in sentences or a word-pair to assess imagination, and view sentences or word-pairs to assess perception (Johnson et al., 1981;Simons et al., 2008;Subramaniam et al., 2018;Vinogradov et al., 2008); in a final test phase, participants are often asked to identify which stimuli were imagined, and which were perceived (Fig. 1A).Apart from behavioral studies, functional and structural neuroimaging studies have been conducted, revealing frontal cortical areas to be involved in reality monitoring (Buda et al., 2011;Kensinger and Schacter, 2006a;Simons et al., 2008Simons et al., , 2017;;Sugimori et al., 2014).
Many studies provide evidence that our memories are fragile, and that false memories are commonplace (Huff et al., 2015;Roediger and McDermott, 1995;Pardilla-Delgado and Payne, 2017;Stadler et al., 1999).False memories occur when we recall events with distorted information, or details that did not actually occur.For example, you might recall paying a credit card bill, but later discovered that you did not.False memories have been interpreted as the consequences of a failure of reality monitoring (Johnson et al., 2012;Schnider, 2001).A classic false memory paradigm is the DRM (Deese-Roediger-McDermott) paradigm (Huff et al., 2015;Roediger and McDermott, 1995;Pardilla-Delgado and Payne, 2017;Stadler et al., 1999).In previous iterations of this paradigm, researchers asked participants to study several lists of words.Each list contained a critical (or lure) word, which was semantically related to other words in the list but was not shown during the study phase.Instead, it was only presented in the test/recall phase of the experiment.Participants often misattributed lures to be "old" and to have been present in the studied list (Huff et al., 2015;Roediger and McDermott, 1995;Pardilla-Delgado and Payne, 2017;Stadler et al., 1999).Typically, each list in a DRM paradigm is formed of similar words or words with a high frequency of co-occurrence among themselves and the critical word, which confuses participants and results in the misattribution of new words to be judged as old (Huff et al., 2015;Meade et al., 2007).
Beyond memory, reality monitoring has also been studied from a perceptual perspective, called "perceptual reality monitoring" (Dijkstra et al., 2021(Dijkstra et al., , 2022(Dijkstra et al., , 2023;;Lau, 2019;Perky, 1910).In studies of perceptual reality monitoring, participants classify whether current experiences are perceived or imagined.A famous example of perceptual reality monitoring is the Perky Effect (Perky, 1910;Reeves and Lemley, 2012), where participants are asked to imagine an object (such as a banana), and at the same time, a subthreshold image about the same object is displayed on a screen (such as a vertical banana).In many instances, participants report that they imagined a specific object (i.e., the vertical banana) but did not actually perceive it on the screen, thus (incorrectly) attributing the source of the experience entirely to their own imagination.These results demonstrate that reality monitoring is an imperfect cognitive ability, as we can misattribute our experiences from either perception or memory (Dijkstra et al., 2021;Johnson, 2006;Lindsay and Johnson, 1991;Perky, 1910).
In the current study, for the first time (to the best of our knowledge), we combined DRM lists of stimuli with reality-monitoring judgments of memories to probe how semantically related lists of words influence source-related judgments of memories.Our task involved three phases: an orientation phase (where participants perceived and imagined specific word pairs and rated their similarity), a practice phase (to train participants how to respond in the test phase), and a test phase (where participants made judgments about whether words from the orientation phase were perceived, imagined or new).Our groups of stimuli were based on words from thirty-six lists provided by Stadler et al., (1999).We were interested in three questions: (1) What is the accuracy of each source (perceived, imagined, new) for judgments in the test phase?(2) For each source type, are some types of judgments more common than others?(3) Finally, how does confidence differ across the three types of memory sources?
Our study is somewhat related to previous work by Henkel and Franklin (1998).In their study, they manipulated the physical and conceptual similarities of objects across "perceived" and "imagined" source conditions during the orientation phase.In the perceived source condition, an image of an object was shown below the written name of the object.In the imagined source condition, only the name of the object was shown, and participants were required to imagine the corresponding object.The order of these perceived and imagined source conditions was counterbalanced across participants.Results showed that imagined objects were judged to be perceived more often when the participants saw perceptually or conceptually related objects.Further, they also found that a "new" stimulus was more likely to be classified as "perceived" than classified as "imagined."These findings, especially for the likelihood of "imagined" experiences being classified as "perceived," Fig. 1.Conceptualization of reality monitoring as a memory task using Multinomial Processing Trees.(A) In this task, the participant either observes a pair words during "perception" condition (e.g., jump and cliff) or imagines the second word of an incomplete word-pair in the "imagination" source condition (e.g.warm and ice; ice is imagined).During the reality monitroing test, the first word is shown, and the participant has to judge whether the second (paired) was either perceived, imagined, or was a new word.(B) The proposed model for reality monitoring using multinomial processing trees by Batchelder and Riefer (1990).In the above model, D p = detectability of a perceived (P) test stimulus as old; D I = detectability of an imagined (I) test stimulus as old; d P = the source discriminability for source P to be observed as P after the test stimulus is detected as old; d I = the source discriminability for source I to be observed as I after the test stimulus is detected as old; b = bias for responding old when item was an undetected stimulus; otherwise it is observed as new (N); a = guessing that a detected but non-discriminated stimulus to be P; g = guessing that a non-detected stimulus is observed as P.
show the remarkable presence of confabulation in healthy participants.Although Henkel and Franklin (1998) showed higher misattribution for imagined sources by using similar stimuli across imagined and perceived source conditions, they did not ask their participants about their confidence in the judgments.Henkel and Franklin manipulated similarity across imagined and perceived sources; here, our study incorporates similar stimuli by drawing upon lists from Stadler et al., (1999), and asks participants to report their confidence in their source judgments.
Other related work comes from Johnson et al. (1981).In their imagination source condition, participants were cued to imagine things within a category using either high-frequency (e.g., fruit-apple, sport-football) or low-frequency exemplars (e.g., flower-lily, animal-pig), or the opposite of a category (e.g., hot-cold).In their perception source condition, word pairs were presented to the participants, which included either two high frequency words from a category, two low-frequency words, or opposite words.Participants reported confidence ratings, which were analyzed from high-frequency, low-frequency, or "opposite" cue trials.They also reported confidence differences based on Signal Detection Theory (Macmillan and Creelman, 2004) including analyzing confidence in false alarms, hits, and misses.They reported that confidence was higher when correctly identifying the source, compared to incorrectly identifying the source, although this was not the primary focus of their investigation.They were primarily interested in how cognitive operations influence reality monitoring based on above mentioned categories.In the current work, we investigate confidence at source level in a reality monitoring task.
Another key distinction between our paradigm and other reality monitoring tasks (Henkel and Franklin, 1998;Johnson et al., 1981;Subramaniam et al., 2018;Vinogradov et al., 2008) is that we did not constrain participants by giving cues or instructions to guide imagination in a specific way, which gave our participants complete volitional control over their imagination.Whereas in the above-mentioned studies, they were constrained by either specific sentences to fill in the blank during imagination (Subramaniam et al., 2019;Vinogradov et al., 2008), or specific word-pairs which provided cues regarding the nature of the object/word they were required to imagine (Henkel and Franklin, 1998;Johnson et al., 1981).Additionally, we used similar words across imagined, perceived, and new source conditions by sampling words from each DRM list (Stadler et al., 1999), which provided a unique way of presenting semantically-related information.
Importantly, in our task, every word in the "new" category in the test phase was a "lure" word from a given DRM list, but we use the term "new" in this manuscript to be consistent with reality monitoring memory literature.Thus, in our experiment, there were always associations between imagined and perceived sources used in the orientation phase and with "new" words introduced in the test phase of the study.Our experiment aimed to characterize the behavioral profiles of reality monitoring for evaluating perceived, imagined, and new information, and to evaluate confidence judgments for different types of judgments in each source.By analyzing confidence ratings, we aimed to provide insights into how metacognitive abilities relate to different memory sources in reality monitoring judgments.Overall, our emphasis on metacognition distinguishes our study from the aims of Henkel and Franklin (1998), Johnson et al. (1981), Subramaniam et al. (2019) and Vinogradov et al. (2008), who did not measure confidence, and were interested in other aspects of recall-related performance.Additionally, our study used approximately three times more stimuli in each source condition than the above studies, testing the limits of participants' reality-monitoring abilities.

Modeling reality monitoring with multinomial processing trees
To understand the cognitive processes underlying our task, we modeled response counts for judgments about each source using multinomial processing trees (Erdfelder et al., 2009;Batchelder andRiefer, 1990, 1999) (MPT).MPT has been previously used in reality monitoring studies where each source ("perceive" (P), "imagine" (I) and "new" (N)) is modeled as a tree with the leaf nodes being the response (Batchelder andRiefer, 1990, 1999;Johnson et. al, 1994;Henkel and Franklin, 1998).Each branch of the tree represents a hypothetical step in the process of identifying the true source, which can help to elucidate the observed responses that arise from multiple unobserved processing stages (see Fig. 1B).MPT models of reality monitoring assume a sequential cognitive process: specifically, that participants first detect stimuli as old (i.e., perceived or imagined) with probabilities D P , D I , and b for "perceived," "imagined," and "new," respectively.When a stimulus is judged as old but is actually new (b), then the observed response is explained as a guessing process, parametrized using the probability g.More specifically, g is the probability of "new" stimulus judged as "perceived" and 1-g is the probability a new stimulus is judged as "imagined," when it is detected as old.This determines the new stimulus' assignment to an incorrect source ("perceive" or "imagine").To discriminate between the true source after it has been (correctly) detected as old, it is described by probabilities d P , which is the probability of discriminating the source to be "perceived" after being detected as old, and d I , which is the probability of discriminating the source to be "imagined" in the processing trees.In other words, d P and d I are conditional on the stimulus being detected as old, depending on the source (D p and D I, respectively).When the discrimination of the source for detected old stimuli fails, it is described by the probability parameters 1d x (x = P, I), proceeding to another guessing process that determines the source judgment measured by the parameter a.We use the term "externalization bias" parameters to refer to parameters a and g.
Further, if the stimulus is not detected as old (with probability 1 -D x ), it is treated as "new" in the nested tree, which is identical to the "new" tree as described above.The probability parameters a, g, and b appear in the tree of each source.Importantly, the parameters of MPT are treated as conditional probabilities.In the source trees of P and I, parameter b is the conditional probability that the stimulus is old but is not detected as old (1 -D x ).This parameter b also occurs in the N tree as an unconditional probability determining that a new stimulus is old; thus, parameter b represents a similar role in all trees, indicating bias towards old.In such MPT models, the probability of a type of judgment/ response is the sum of the product of all the parameters.For example, the probability of observing a "perceived" judgment when the source is "perceived" is given as P("P"|P) = D P d P + D P (1-d P )a + (1-D P )bg.For a more detailed understanding, one can refer to previous literature describing this process (Batchelder and Riefer, 1999;Bird et al., 2009;Erdfelder et al., 2009).
Finally, different hypotheses can be tested by constraining the parameters of MPT.Johnson et al. (1994) and Henkel and Franklin (1998) used MPT to investigate cognitive aspects of their task.Johnson et al. (1994) assumed D p = D I and a = g; thus, they assumed the probability of detecting a stimulus to be old is the same, irrespective of source (imagined or perceived).Additionally, the guessing rates a and g were set to be equal; a is the probability that a stimulus detected as old (but was not discriminated) and was classified as perceived.The parameter g is the probability of guessing that a non-detected stimulus was perceived.The choice between which MPT parameter constraints to use, by Henkel and Franklin (1998) and Johnson et al. (1994) was decided a priori by them; Henkel and Franklin's (1998) model was different from Johnson et al. (1994) in that they did not assume D p = D I but only a = g.We included both of their versions (along with three others) as possible models to account for our data.
We explored various models with different constraints, resulting in different hypotheses about the mechanisms underlying the observed judgments to find the best fit for our data.The constraints in our five different models were as follows: (a) D p = D I and a = g, (b) D p = D I , (c) d P = d I , (d) a = g, and (e) d P = d I and a = g.Individual level parameters were obtained using a latent-trait approach in MPT models, which is a Bayesian approach to estimate parameters and helps to extract individual level parameters (Klauer, 2010;Groß and Pachur, 2020).In the past, most MPT models used the data after being pooled across individuals, which yields group-level estimates, but fails to obtain reliable parameter estimates for individual participants due to a lack of sufficient trials.Although Bayesian alternatives to latent trait approaches do exist, like the Beta-MPT approach, Groß and Pachur (2020) showed that a latent-trait approach is more reliable (i.e., for source (reality) monitoring).The key difference between the two Bayesian-based approaches is that the former explicitly models the correlations between parameters at the individual level, whereas the latter does not model these parameter correlations at the individual levels.
For our five models listed above, we compared for goodness of fit using DIC (deviance information criterion) and WAIC (Watanabe-Akaike information criterion) scores.DIC and WAIC scores are similar to BIC and AIC, which are traditionally used in model fitting, but they are more suited for Bayesian approaches to select between competing models (McElreath, 2020).In our modeling work, we selected the model with lowest DIC and WAIC scores among the five models listed for further analysis.Group level parameter estimates for the best model were compared with each other using means and 95% credible intervals.The estimated probability for the judgments conditioned on the source was calculated at the individual level for each judgment given the source and was correlated with confidence judgments to assess how well confidence tracks the probability of judgment in each source.

Using machine learning to evaluate if relationships between test stimuli and source condition bias task performance
We were also interested in whether sampling the stimuli from DRM lists could form latent clustering in each source, which might be guiding judgments.If such a situation happens, the participants might be responding based on the "task bias" introduced by the stimulus design, rather than their experiences.To investigate this, we utilized recently developed tools from natural language processing and artificial intelligence.Since our stimuli were words, each stimulus could be represented mathematically as a list of numbers called "embeddings," where words with similar meanings have similar representations, and the embedding values reflect the semantic and syntactic relationships between words (Khurana et al., 2023).Various machine learning algorithms can use these embeddings as inputs to predict target labels for classification purposes.Using three machine learning classifiers (logistic regression (LR), support vector machines (SVM), and bidirectional encoding representations from transformers (BERT)) with word embeddings as inputs, we conducted two analyses: first, we trained models to predict the true source labels for a given word ("imagined," "perceived," and "new"), and evaluate whether the source decoding accuracy of these predictions correlated with human performance.Second, we trained models to predict participants' responses (responding perceived, imagined, and new) and evaluate whether judgment decoding accuracy correlated with human performance in each source (perceived, imagined, and new).For LR and SVC, we used embeddings generated by the word2vec method (Mikolov et al., 2013) for the words used in the test phase of the experiment for a participant as input.For BERT-based sequence classification, we again used words from the test phase of the experiment as input, generated embeddings for this word list, and used these embeddings to predict participants' behavior for each of the three source types (Devlin et al., 2018).
Our logic for this analysis was the following: if participants learned some latent representation or rule for the test words in each source condition (i.e., if the words used in each condition had systematic relationships due to failing to properly randomize across sources), the model trained to predict the true source type should significantly correlate with participants' behavior in a given source.To anticipate our results, models trained to predict the source type did not significantly correlate with participants' responses in our task, suggesting there is no "task bias."Further, prediction accuracy from models trained on participants' responses correlated with that of the participants, indicating participants' responses were primarily influenced by their experiences during the task.Thus, in our paradigm, we provide evidence of interactions between memory sources and response types on response rate and confidence ratings, and show that memory-related content is the primary influence on participants' behavior.

Participants
We used MorePower 6.0.4 (Campbell and Thompson, 2012) to compute our required sample size for this task.We set alpha to be 0.05 and power to be 0.95 for a one-way ANOVA with three levels (source: perceived, imagined, and new), with the primary effect of interest being the influence of "source" on accuracy.We a priori assumed the effect size to be η p 2 = 0.2.With these parameters, our targeted sample size was 34 participants.In total, we collected data from 36 participants to account for possible exclusion of outliers.Participants gave consent to take part in the study before participating (UF IRB #202102019), and they were given course credits or $15 as compensation for participating in the study.The study took approximately one hour to complete.Two participants did not complete the study.Of the remaining 34 participants who successfully completed the study, there were 25 females and 9 males with an overall mean age of 20.9 years (SD = 2.5 years).Data from these 34 participants were used for further analysis.

Stimuli
Our stimuli were drawn from 36 lists used to study false memories in the DRM paradigm (Stadler et al., 1999).Each list contained a specific target word, also called a "critical lure."Each list was composed of 15 words, for a total of 576 words in the experiment (36 target/lure/critical words + 540 other words).For each participant, to select the words for each source condition, words were sampled randomly from each of the lists.From each of the 36 lists, for the perceived source condition, four words were sampled to form two word-pairs.For the imagined source condition, 2 words were sampled, and each word was presented with a blank during the orientation phase.Finally, two more words were sampled for the new condition, to be presented in the test phase.In total, for each participant, there were 72 word pairs for the perceived condition, 72 words for the imagined condition, and 72 words for the new condition.Overall, there were 144 trials for the orientation phase and 216 trials for the test phase, described below.

Behavioral experiment
Our behavioral task and overall trial structure is shown in Fig. 2. The experiment was divided into three phases: an orientation phase, a response practice phase, and a test phase.Instructions for each phase were provided at the beginning of each phase.The experiment was conducted using PsychoPy 2022.1.4(Peirce et al., 2019) on a computer with a screen size of 24 in.and a 1080 × 720 pixels resolution.

Orientation phase
On each trial in this phase, participants were instructed to observe either a word pair (e.g., king and queen) or a word and a blank, and do one of two things: (1) type in the second word if it was presented, or (2) if a blank was displayed in place of the second word (e.g., joker and _____), they were required to imagine a second word and type their imagined word.The two types of trials were randomly interleaved.After participants typed in the second word, they were instructed to judge the relatedness between the words in the pair on a scale of 0% (entirely unrelated) to 100% (entirely related).Participants were instructed that relatedness could be judged based on phonology, semantic relation, or whether they occurred in any category together.
Each trial started with a display of a fixation cross for 1 s (Fig. 2B, top row).After the fixation cross, a word pair or a word and a blank were displayed for 5 s.During this time, participants were instructed to either observe the words (if both words were given) or imagine the second word if a blank was presented.Following this, they typed the (presented or imagined) second word.Finally, participants judged the relatedness between the two words on a visual rating scale.Importantly, participants were never instructed that the words in this orientation phase would be used in the final test phase.Participants were also instructed to always imagine a different word that had not yet been used in either the perception or imagination trials when they had to imagine the second word.Participants were explicitly instructed to use only English words.They were given feedback if they typed a word that had already occurred for imagined trials or if it was a non-English word.Typed words were cross-referenced on each trial using an English corpus of the Python-based NLTK library (Bird et al., 2009).The experiment allowed for misspelled, typed words; if a misspelled word was + /-three characters of the corresponding word from the correct spelling of the typed word from the NLTK library, it was marked as not an English word.If the typed word was not present in NLTK's English corpus or it did not qualify for our definition of a misspelled word, feedback on the screen was presented after participants submitted their typed word ("Input word is not an English word") and participants then proceeded to the next trial.

Response practice phase
In the response practice phase, participants performed a lexical task based on the practice task in Johnson et al. (1994).The purpose of the task was to practice giving speeded responses, to prepare participants for the test phase.This task was irrelevant to the main objectives of the current study.Participants were instructed to report whether a presented stimulus (a written word) was in uppercase (using the left arrow key), in lowercase (using the right arrow key) or was a non-word (using the up-arrow key).Non-words are words that do not have any meaning and are not used by English language speakers (e.g., "judam", "cyrh").The choices for the left and right arrow keys were counterbalanced across participants.There were 80 total trials, with 40-word and 40 non-word trials.The list of word and nonword stimuli was downloaded from an online database of the "English Lexicon Project" (Balota et al., 2007).Forty words and non-words were randomly sampled from each list.Twenty words were randomly sampled to be uppercase or lowercase.Non-words could either be upper-case or lower-case and were selected to be pseudorandomly.We ensured that none of the words in this phase were used in the orientation phase or the subsequent test phase of the current study.
Each trial started with a fixation cross whose duration was between 300 ms and 500 ms, randomly sampled from a truncated exponential distribution.Each stimulus in this phase was presented after the fixation cross and remained on the screen until participants responded by the button press.Participants were also instructed to be as fast as possible and not worry about accuracy.They were given visual feedback if the response time was less than 75 ms ("TOO FAST") or when the response time was greater than 350 ms ("TOO LOW").The feedback was displayed for 500 ms.

Test phase
At the beginning of this phase, participants were instructed to respond based on their knowledge of words from the orientation phase.Specifically, for a given (first) word that was presented in the test phase, participants were instructed to report whether the second word that followed (in the orientation phase) was perceived, imagined, or whether the presented word was a new word.Participants responded using an uparrow key for "new" word and using either of the arrow keys for the other two options.The response keys for "perceived 2nd word" an" "imagined 2nd word" were counterbalanced (left/right) across participants.
Participants also reported their confidence in their judgment on a scale of 1 (not confident at all) to 6 (highly confident) after they responded with the arrow keys by pressing the corresponding digit key on the keyboard using the other hand.Participants were given feedback on their reaction time for pressing the arrow key, similar to the feedback in the previous phase after they had reported their confidence.They were given visual feedback if the response time was less than 75 ms ("TOO FAST") and when the response time was greater than 350 ms ("TOO SLOW").The feedback was displayed for 500 ms.
The task for the test phase is shown in the bottom row of Fig. 1B.Each trial started with a fixation cross whose duration was between 300 ms and 500 ms, randomly sampled from a truncated exponential distribution.The test word in this phase was presented after the fixation cross and remained on the screen until the participant responded.The response keys for "perceived 2nd word" and "imagined 2nd word" were counterbalanced across participants.

Behavioral data preprocessing
There were 4896 and 7344 trials across the orientation and test phases, respectively, from all participants.We performed several quality checks after the data collection was completed.We found that in 4 trials out of all 4896 trials (pooled trials from all the participants), the second word was not typed in the orientation phase.We also verified that none of the "new" words from the test phase occurred in the orientation and practice phase for each participant; Finally, we found that words from 4 trials of the response practice phase were common to the words used in the orientation phase; we removed such trials from the orientation phase and corresponding trials from the test phase.
For the perceived source condition trials in the orientation phase for each participant, we checked whether the presented second word matched the typed second word.If they did not match, we removed those trials from the orientation phase and the corresponding trials from the test phase.For the imagined trials in the orientation phase, we The orientation phase trial structure.This was followed by a screen which required participants to type the second word, and then rate the similarity between either the two words (perception) or the perceived word and the imagined word (imagination).(Below) The test phase trial structure was slightly different: following fixation and the presentation of a first word, participants were instructed to report whether the second word that followed (from the orientation phase) was perceived, imagined, or new.A final feedback was only displayed when the reaction time for source judgment was either faster than 75 milliseconds, or more than 350 milliseconds.examined whether the imagined second word was duplicated in the perceived or imagined source conditions.Such trials were also removed from the orientation phase and their corresponding trials from the test phase were also removed.Finally, all words (the first word of the pair, typed second word from the orientation phase and new words from the test phase) were referenced in the vocabulary of Word2Vec based on Google News available in the Genism 4.3.1 Python package (Rehurek and Sojka, 2011), and corresponding trials were removed from both the orientation phase and test phase if the word was not found.Based on these criteria, around 9.5% of all the trials (pooled participants) in the orientation phase and 8.1% in the test phase were dropped.For the resulting dataset, we verified that all participants had above 33.33%(chance level accuracy for a three alternative choice task).Finally, we removed trials from each participant whose reaction time exceeded three standard deviations.

Multinomial processing tree (MPT)
We used the treeBUGS 1.4.7 (Heck et al., 2018) package in R 4.0.3(R Foundation for Statistical Computing, R, 2018) for our model fitting, using observed counts of judgments in each source (Supplementary Material Table S1).Our MPT structure is based on the descriptions of MPT in source monitoring by Batchelder and Riefer (1990).We fit five models with the following constraints: (a) D p = D I and a = g, (b) D p = D I , (c) d P = d I , (d) a = g, and (e) d P = d I and a = g.Each of the models used four chains with 500000 iterations and 100000 as burin-in samples, with 40000 samples for adaptation and 10 samples for thinning.The convergence statistic R was checked for each parameter, and it was considered convergent if R < 1.1 (Gelman and Rubin, 1992).Further, the fitness was evaluated visually using posterior predictive checks in terms of covariance of each type of judgment with the others and is presented as the best fitting model in the results.We used the estimated probabilities computed for each participant for each judgment given the source, correlated it with corresponding confidence ratings, and reported the group level mean parameter estimates with 95% credible intervals.

Machine learning experiment
It is possible that in our task, participants may not be using their memories from the orientation phase to drive behavior in the test phase; instead, it is possible that systematic semantic relationships could emerge across source conditions as we sampled, and these relationships could instead be the driving force behind behavior.To control for this possible confounding variable, we trained LR, SVC, and BERT models using word embeddings as inputs, to predict (1) the specific source for a given word in our task, and (2) participants' responses.Each model is defined by different characteristics: LR learns linear relationships between embeddings and targets, while SVC and BERT learn non-linear relationships between inputs and outputs by learning the similarity among them.In LR, during its training, the model does not care what the second sample is; that is, each input is considered as an individual sample.However, SVC takes into account other samples: the model learns the relationship between input and output by projecting the input (for our purposes, word embeddings) into higher dimensions, utilizing distances between each input (Noble, 2006).Thus, it accounts for the relationship of input with other inputs of the same category as it predicts a given category (in our case, the memory source).Alternatively, BERT is a non-linear classifier trained on a large corpus of texts to generate encoding of input words and can account for the similarities between stimuli of each classification category (Devlin et al., 2018) .It generates its own embeddings, which differ from SVC and LR (neither SVC nor LR can generate their own embeddings).BERT can also account for the order effects of the stimuli in training (from the perspective of our use case when stimuli occurred in a trial), while SVC and LR use fixed embedding for a word irrespective of the order they occurred in the test phase.Thus, each of our three models has distinct advantages and disadvantages in their usage.
The models were trained using word embeddings as input (for LR and SVC models) and using the word itself (for the BERT model) to predict two different things: one group of models was trained to predict the source condition from which a given word came ("perceived," "imagined," and "new").In other words, it decoded the source condition of the word to which it belonged (which we call "source decoding") to test the hypothesis of whether the words and their assigned sources themselves primarily influenced each participants' performance.A second group of models (using all three model types) were trained to predict the participant judgments using participants' responses in the test phase.During this process, the classifiers were trained to learn each participants' response biases during the test phase (which we call "judgment decoding").If cognitive processing from the orientation phase primarily drives participants' performance, then the performance during the judgment decoding of a classifier should correlate with the participants' performance in each source.
For a given participant, we trained and tested LR, SVC, and BERT models using a 5-fold stratified cross-validation where the data set was divided into 5 sets; it was trained on 4 sets and tested on the remaining one set.This process was iterated over the 5 sets while conserving the proportion of target labels in training and testing.The performance of a classifier was measured as accuracy in each source as the mean of the performance in 5 test sets.For SVC, we used the radial abscess kernel (RBF), and performed a grid search over hyperparameters: regularization and width of the RBF kernel in each cross-validation were optimized, and the best hyperparameters were used for the test set.With the BERT sequence classification model, in each cross-validation, we used 5 epochs of training to train the model with a batch size of 16, and we used cross-entropy loss function and ADAMw with weight decay as the optimization algorithm.

Behavioral results
For statistical analysis, we used JASP 0.14.1 (JASP Team, 2023).A one-way ANOVA with three levels (imagined, perceived, and new) was performed to evaluate the effect of memory source on accuracy and metacognitive ability.Two-way ANOVAs with source (imagined, perceived, and new) x judgment (imagined, perceived, and new) were performed to evaluate how these variables influence response rates, confidence judgment, and reaction times.Whenever the sphericity assumption of ANOVA was violated, the Greenhouse-Geisser correction was applied to the F-statistics.Metacognitive ability was measured for different memory sources by computing Pearson's r correlation between accuracy and confidence ratings for each participant, and a one-way ANOVA was used to infer the effect of source (imagined, perceived, and new) on metacognitive ability.Post-hoc pairwise comparisons were conducted using Holm-Bonferroni correction to look at differences in metacognitive among the sources.

Accuracy
A one-way repeated-measures ANOVA was performed to evaluate the effect of word source (imagined, perceived, and new) on accuracy (Fig. 3).The omnibus test was significant, with F(1.658, 54. 123.Thus, we can infer that in our paradigm of reality monitoring using DRM lists, the accuracy of imagined sources was reduced compared to new and perceived sources.

Response rates for different sources
We calculated response rates for each type of judgment (i.e., the proportion of judgments) given a particular source to evaluate whether certain types of responses were more likely than others (Fig. 4).We conducted a two-way repeated measures ANOVA with source (imagined, perceived, and new) and judgment type (imagined, perceived, and new) as independent variables (Sugimori et al., 2014).The main effect of the source was not significant (F(2,66) = 8.012e-14, p = 1,η p 2 = 2.428e-15), but we did find a significant main effect of judgment (F (1.658, 54.717) = 30.098,p < 0.001, η p 2 = 0.477) and significant interaction between source and judgment (F(2.280,75.228) = 60.962,p < 0.001, η p 2 = 0.649).
Post-hoc comparisons for the main effect of judgment revealed that judging the paired second word to be imagined (M = 0.167, SE = 0.023) was significantly less common than judging the second word to be perceived (M = 0.481, SE = 0.028) with mean difference = − 0.316 (95% CI = [− 0.437, − 0.194], t = − 6.540, p holm <0.001, d = − 1.699).Contrasting imagined stimuli with judging a stimulus to be the new word (M = 0.351, SE = 0.018) revealed the mean difference to be − 0.185, which was also significant (95% CI = [− 0.266, − 0.105], t = − 5923, p holm < 0.001, d = − 0.998).We further compared new and perceived response rates and found these to be significantly different, too, with mean difference = − 0.130 (95% CI = [− 0.233, − 0.027], t = − 3.184, p holm = 0.003, d = − 0.701).Overall, these post-hoc tests revealed that stimuli were more likely to be judged as perceived, compared to either new or imagined.As expected, post-hoc comparisons for the main effect of the source did not reveal any significant differences.
For the post-hoc comparisons of the interaction effect, we were interested in the proportion of judgments given a source; thus, we made nine comparisons with three comparisons in each source for different judgments.When the source was imagined, the proportions of the response being judged as imagined (M = 0.285, SE = 0. Overall, our primary finding was that when the source was imagined, individuals were more likely to misattribute it as perceived than correctly attribute it to be imagined, which can be taken as evidence for confabulation occurring in this paradigm.When the source was new, participants often accurately attributed stimuli as being new.In contrast, when incorrect, participants were more likely to attribute the source to be perceived than to be imagined.Finally, when the source was perceived, participants were often correct in their judgments, but when they were incorrect, they were more likely to classify it as new than imagined.Colored lines connect mean response rates of judgments in each source condition.Results showed a main effect of judgment, with perceived judgments (green) having the highest response rates, followed by new judgments (purple) and the smallest response rate for imagined judgments (orange).Results also showed an interaction between source and judgment type: for perceived sources, participants were most likely to correctly judge the source as perceived; however, for the imagined source judgments, participants were more likely to incorrectly judge the source as perceived.For new source judgments, participants were highly likely to correctly judge it as new.

Confidence
To analyze confidence ratings, we first removed participants who had a zero-response rate for any specific judgment type, given a source; thus, we had a total of 26 participants for this analysis.We conducted a two-way repeated measures ANOVA with 3 levels of source (imagined, perceived, and new) X 3 levels of judgments (imagined, perceived, and new) (Fig. 5).We found a significant main effect of source (F(2,50) = 5.862, p = 0.005, η p 2 = 0.190), a significant main effect of judgment (F (2,50) = 7.342, p = 0.002, η p 2 = 0.227), and a significant interaction between source and judgment (F (2.791, 69.786) = 11.066,p < 0.001, η p 2 = 0.307).
Post-hoc tests for the effect of source revealed that there was a significant difference between confidence for the source of stimuli when it was imagined (M = 4.070, SE = 0.181) compared to when it was new (M = 3.780, SE = 0.190) with mean difference = 0.279 (95% CI = [0.060,0.498], t = 3.159, p holm = 0.008, d = 0.234).Further, imagined stimuli were more confident than perceived stimuli (M = 3.824, SE = 0.181) with mean difference = 0.241 (95% CI = [0.022,0.460], t = 2.723, p holm = 0.018, d = 0.202).We found no significant difference between new and perceived sources (p holm = 0.665).Together, this shows that imagined sources had overall higher confidence levels than perceived or new sources, while the confidence between perceived and new was not significantly different.
Post-hoc tests for the main effect of judgment revealed that when a stimulus was judged as new (M = 3.577, SE = 0.196), it had significantly lower confidence compared to when it was judged as perceived (M = 4.245, SE = 0.193) with mean difference = − 0.654 (95% CI = [− 0.654, − 1.078], t = − 3.831, p holm = 0.001, d = − 0.549).Whereas when stimuli judged as imagined (M = 3.918, SE = 0.203) were compared to stimuli judged as either new or perceived, we found no significant difference (p = 0.102 for both comparisons).Thus, post-hoc tests on the effect of judgments revealed that when the stimuli were judged as new, they exhibited lower confidence than when a stimulus was perceived.
For the post-hoc comparisons for the significant interaction effect, we looked at the confidence in the judgments given a source; this led to 9 comparisons with 3 in each source.For imagined sources, confidence differed significantly when it was correctly judged as imagined (M = 4.435, SE = 0.253) compared to when it was incorrectly judged as new (M = 3.327, SE = 0.249) with mean difference = 1.093 (95% CI = [0.616,1.570], t = 4.525, p holm < 0.001, d = 0.917).However, correct imagined judgments did not significantly differ from judgments incorrectly attributed as perceived (M = 4.473, SE = 0.217) with mean difference = − 0.043 (95%CI = [− 0.520, 0.434], t = − 0.043, p holm = 1, d = − 0.036).There was also a significant difference when the imagined source was incorrectly judged as new, compared to when it was incorrectly judged as perceived, with mean difference = − 1.

Relatedness ratings
In the orientation phase, using a scale of 0-100%, we asked participants to rate how each pair of words were related to one another after they had typed the second word.In this phase, to match the effort between source conditions, participants were instructed to type the second word when both the words of a word pair were displayed (the perceived source condition) and to type their imagined second word on trials when only the first word was given (the imagined source condition).Importantly, participants were not explicitly instructed to imagine a related word when the word pair was incomplete.They were instructed that they could imagine any word which has not been previously used in the trials.We performed paired samples t-tests (two-tailed tests) to evaluate the difference between relatedness ratings for words in imagined and perceived source conditions (Fig. 6).We found a significant difference between the mean perceived relatedness rating for the first word and the imagined second word (M = 72.253,SD = 12.455) compared to similarity between the two presented words from the perceived trials (M = 53.119,SD = 8.765) with the mean difference = 19.134%(95% CI = shows the median and IQR of the second and third quartile, with whiskers denoting 1.5 times IQR from the edges of second and third quartiles.Colored lines connect average confidence across all participants within each source condition.A main effect of source was found on confidence ratings, with imagined sources being rated with the highest mean confidence, and no significant difference between confidence ratings for new and perceived sources, which were lower.Further, significant differences in confidence were identified: correct perceived judgments were rated with high confidence, compared to incorrect perceived judgments identified as new or imagined.For the imagined source (middle), there was no difference between correct imagined judgments and incorrect perceived judgments.For the new source, mean confidence for each of judgments did not significantly differ from each other.Fig. 6.Raincloud plots of relatedness ratings of words in word-pairs, for the imagined and perceived source conditions during the orientation phase.Boxplot markings follow previous conventions in the manuscript.Participants rated word-pairs to be more highly related in the imagined source condition compared to the perceived source condition.Ratings were administered on a scale of 0% to 100%.*** p < 0.001 [15.805%, 22.464%], t(33) = 11.692,p < 0.001, d = 2.005).We found that under voluntary imagination, participants tend to imagine a highly related word to the given first word, even when they are not explicitly instructed to do so.

Metacognitive ability is highest for the perceived source
In our study, metacognitive ability is measured using Pearson's correlation between accuracy and confidence (Fig. 7) in each source for each participant (Fleming and Lau, 2014).Four participants were dropped from this analysis because they had no variance in either accuracy or confidence, resulting in 30 participants for our one-way ANOVA on the correlation coefficient with three levels of source (perceived, imagined and new) as a factor.We found a significant main effect of the source with F(2, 58) = 6.581, p = 0.003 and η p 2 = 0.190.

Best-fitting model and its parameters
Based on DIC and WAIC scores, the model with d p = d I fit the from the current study the best (Table 1); d p = d I captures the assumption that the probability of discriminating a stimulus to be perceived or imagined, respectively, is equal.The posterior predictive check using pairwise covariances in each type of judgment conditioned on the source (Fig. 8B) suggested that the best fit model fit the observed data in a satisfactory manner.
The group level mean parameter values and corresponding 95% credible intervals (CI) of the best fitting model (d p = d I ; DIC = 912.4;WAIC = 997.180) is presented in Table 2 and Fig. 8A.To refresh concepts for the reader, detectability reflects whether a source is detected as old (either perceived or imagined) (see Fig. 1 for the MPT).Detectability can change based on source, which is why there are two parameters (D P and D I ).We found the detectability of the perceived source as old, D P (M = 0.596; 95% CI = [0.500,0.680]), to be significantly lower than that of imagined source stimuli, D I (M = 0.742; 95% CI = [0.629,0.835]).If taken at face value, these detectability values suggest that the perceived source should display more inaccurate observations, as it has a lower value than D I .As shown in Fig. 3, this is not the case.However, importantly, we find that the guessing/bias (externalizing bias) probabilities a (M = 0.878; 95% CI = [0.749,0.967]) and g (M = 0.796, 95% CI = [0.710,0.870]), which reflect the biases for non-discriminated and undetected test stimuli, respectively, to be observed as perceived, are significantly higher than the D x (x = P, I) values.This suggests that these bias parameters may compensate for a lower value of D P for observing perceived judgments; in other words, participants have a bias to report stimuli as perceived, and this could lead to higher accuracies in the perceived source and misattribution of observations as "perceived" in the imagined source.
The discriminability of the stimulus after it has been detected as old is assumed to be equal in our best fitting model, d x = d P = d I (M = 0.112, 95% CI = [0.029,0.223]).Importantly, the estimated value of d x is significantly lower than the bias parameters a and g and is also significantly lower than the detectability parameters D x (x = P, I).As noted above, together, this suggests that the discriminability for the sources is low, suggesting that our use of highly similar DRM stimuli across sources led to an inability to effectively discriminate sources.
Finally, the parameter which quantifies the probability of the "new" stimulus to be detected as old is b (M = 0.331, 95% CI = [0.255,0.413]).This parameter was significantly lower than D x , significantly higher than d x , and significantly lower than the externalizing bias parameters (a and g).This suggests that participants were less likely to judge a new stimulus (which is not detected as old) as perceived or imagined, and when they did, they were more likely to misattribute it to be perceived rather than imagined because of high values of a and g.
The second-best model assumed a = g (DIC = 948; WAIC = 1037.292).As with the first model, this model displayed high values of a (M = 0.827, 95% CI = [0.749,0.893]) compared to all other parameters.Overall, all the parameters from this model showed similar values to best fit model (d p = d I ); the only interesting difference was with d p (M = 0.291; 95% CI = [0.036,0.588]) which was higher than d I (M = 0.052, 95% = [0.002,0.170]) but was not significantly different.This suggests that the perceived stimulus might have better discriminability, as it had low relatedness rating among the words in the word-pair when compared to the word-pairs in imagined source condition.The remaining parameters in the a = g model included b (M = 0.323, 95% CI = [0.247,0.404]), D p (M = 0.604, 95% CI = [0.509,0.687]), and D I (M = 0.756, 95% CI = [0.647,0.844]), and we again note that these were similar to the best fit model with d P = d I .In summary, both models support the conclusion that the behavioral effects might be driven by an overall bias to externalize our experiences (i.e., judge experiences to be "perceived").

Correlation of confidence judgments and estimated probabilities of judgments
We also investigated how confidence judgments track the estimated probability of a judgment, conditioned on the source from the best fit model at the group level.To do this, we correlated the two for each Fig. 7. Raincloud plots of metacognitive ability for different sources as measured using Pearson's correlation.Boxplots follow previous conventions in the manuscript.Metacognitive ability is measured as the Pearson's correlation between accuracy and confidence ratings.The perceived source had significantly higher metacognitive ability than the imagined or new sources; the imagined and new sources did not significantly differ from each other.**p = 0.007, n.s.reflects non-significance.source.In other words, the probability of a judgment given a source was estimated from the best fit model (d p = d I ) and was correlated with the corresponding mean confidence ratings for each participant (Fig. 9).We found that at the group level, it was only in the "new" source that participants' mean confidence rating tracked the probability of making judgments in a specific (new) source (accuracy) (r = .51,p = .008).
Although in the perceived source, the correlation between estimated probability of true judgments ("perceived") and confidence trended towards significance (r = .37,p = 0.062).An interesting point to note is that for the perceived source, only the perceived judgments showed a trend towards significance; however, in the imagined source, both perceived and imagined judgments had positive correlations, although neither reached significance.The current results suggest that at the group level, confidence is only tracking the estimated probability of accurate judgments in the "new" source (Fig. 9) even though metacognitive ability of individuals is highest for perceived judgments (Fig. 7).We comment further on this result in the Discussion section.

Correlating participants' source accuracy with decoding prediction accuracy
Finally, we performed machine learning analyses to determine whether models trained to predict (1) participants' judgments or (2) the source (perceived, imagined, new) that a given word was associated with would significantly correlate with participants' source accuracy during the test phase.As shown in Fig. 10, we found high correlations among machine learning classifiers trained to predict participants' judgments and participants' actual behavioral performance in a given source, with highly significant correlations for all three model types and all three source types (blue dots and labels, Fig. 10).Critically, classifiers trained to predict the source of each word did not significantly correlate with participants' behavior in the test phase, in all but one case (black dots and labels, Fig. 10).This result provides evidence that while   and the line represents the correlation between the response of a judgment in a given source and the corresponding mean confidence rating.The result suggests that it is only in the new source condition that confidence tracks accuracy at the group level.In the other two sources, perceived and imagined, their correlations showed trends for the relationship between confidence and accuracy, but none of the correlations were significant.n.s.signifies "not significant," i.e., p > 0.07.semantic information may still influence participants' behavior in this task, the behavioral effects exhibited by our participants are not primarily driven by systematic biases in semantic relationships across source conditions in the test stimulus.Instead, they are primarily driven by memories created during the orientation phase of the experiment.In this sense, our machine learning analyses provide an additional control to rule out this potential bias in participants' response tendencies.

Discussion
Previous research into reality monitoring of memories has yet to investigate the nature of false memories for semantically related information, and the associated metacognitive beliefs about those memories (Buda et al., 2011;Johnson et al., 1981, Henkel andFranklin, 1998).In this study, our overall goal was to assess source-specific profiles for response rates and confidence in a reality monitoring task.Using standardized stimuli for false memories from Stadler et al., (1999), we were able to show a unique interaction between source and judgment for both response rates and confidence ratings.Specifically, when the source was imagined, individuals were more likely to misattribute it as perceived; further, they reported similar average confidence for incorrectly reporting "perceived" and correctly reporting "imagined" in the imagine source condition.We also showed that in the perceived source condition, metacognition better tracked source accuracy when compared to the imagined or new source conditions (noting that there was no difference between these latter two source conditions).Using multinomial processing trees (MPT), modeling results indicated that externalizing bias likely drove behavior to judge a test stimulus to be perceived.Additionally, correlations between the estimated probability of judgment given a source and confidence showed positive correlations between source and confidence for all three source types, although only the "new" source correlation was significant.Finally, using machine learning, we confirmed that behavior was primarily driven by memories about the sources themselves and not relationships between the test words and their assigned source (i.e., that there was no task-related bias due to the design of the experiment).
The results for task accuracy showed that the imagined source condition had the worst accuracy, compared to the perceived and new source conditions, and that there was no difference between the latter two source conditions.This result has not been observed previously, and we attribute it to voluntary imagination resulting in highly related words, compared to the relatedness ratings of the perceived word pairs.Johnson et. al. (1981) found that their low frequency category condition had better accuracy for source-related judgments; based on this, we hypothesize that the imagined source condition (with higher relatedness ratings) caused poorer identification of sources (noting that the perceived source condition, with lower relatedness ratings, had higher accuracy for source-related judgments).As to why participants produced word pairs with higher degrees of relatedness in the imagined source condition even when not explicitly instructed to do so, we speculate that producing related words may require less cognitive control and effort.Thus, low cognitive control during the imagination source condition could have resulted in poorer encoding of the imagined experience (compared to a perceived stimulus), and thus could underlie poorer accuracy for the imagined source condition.This finding is in line with previous proposals on the reality monitoring of errors, where participants are likely to be inaccurate when the generation of experience is relatively natural or effortless (Finke et al., 1988;Johnson, 2006;Johnson and Raye, 1981;Subramaniam et al., 2018).
The interaction of source and judgment on response rate has been shown previously by Sugimori et. al. (2014), but the post-hoc results from the current study are different.Sugimori et. al. (2014) investigated reality monitoring between heard (perceived) and imagined words.They found that for heard stimuli, participants were more likely to judge the stimulus as heard compared to imagined, and for imagined words, there was no significant difference between the two judgments.We found that participants were more likely to judge the test stimulus as perceived in both imagined and perceived source conditions.The misattribution of imagined experiences to be perceived has also been found by Henkel and Franklin (1998), although they did not investigate confidence in memory judgments.Of the central interest here is that the participants were making correct judgments in the perceived source condition, and often misattributed experiences to be perceived in imagined source condition; we posit that this, too, might be linked to the relatedness rating.
The post-hoc comparisons for new stimuli were similar to Johnson et. al. (1981), where participants were more likely to correctly judge the true source as new, and on misattributions, they were more likely to judge new stimuli as perceived, rather than imagined.Our results revealed a similar trend.Potentially, this could be due to a participant's bias towards responding that a test stimulus was perceived.To understand whether this might be the case, we used a cognitive modeling approach (MPT).From our MPT results, we found that the best fit model was a version where discriminability parameters for imagined and perceived source conditions were equal (when a stimulus has been detected as old).Parameters reflecting the bias to respond that a test stimulus was perceived (a and g) were high compared to other parameters; this likely reflects participants' tendency to respond "perceived" in the imagined and new source conditions, as well as their bias for reporting an undetected stimulus to be old to be very low, which results in high accuracy when judging new stimuli.
A simple explanation for this behavioral profile is that it could be due to a task bias, rather than a cognitive bias.In other words, we were interested in whether sampling the stimuli from DRM lists in each source could form latent clustering, which might be used as a cue to guide judgments.If such a situation happens, the participants could be responding based on the task bias introduced by the stimulus design rather than their actual experiences.The notion that task bias (rather than cognitive bias) may influence results has also been investigated by Sugimori et. al. (2014).In that study, they used fMRI and found that the reality monitoring effects were more likely due to differences in the experiences of imagination and perception.An account based on signal detection theory has also been posited to explain these effects: during the imagined source condition, participants may have a liberal criterion that causes false positives for imagined experiences to be judged as perceived when the two experiences were congruent (similar) (Dijkstra et. al, 2021), thus resulting in source-mixing.This could be occurring in our paradigm as well for the imagined source.Future work is needed to investigate which of the possible explanations is resulting in the bias to respond to a test stimulus to be perceived.
We also investigated metacognitive judgments using confidence ratings in our task and (to the best of our knowledge) showed new findings regarding an interaction between source and judgment type when compared to previous results (Ranjan et al., 2023;Buda et al., 2011, Johnson et al., 1981).One aspect of our findings was similar to previous research: the imagined source condition had overall high confidence, irrespective of the correctness of the judgment (Johnson et al., 1988).In post-hoc tests for the perceived source condition, we found that the confidence was higher for correct perceived judgments compared to incorrect perceived judgments.This post-hoc result is in line with the general understanding of metacognition research (Fleming, 2023;Fleming and Lau, 2014), where it has been shown that participants tend to respond with higher confidence during correct judgments compared to incorrect judgments.
During our imagined source condition, participants had similar confidence between correctly judging the stimulus to be imagined and incorrectly judging it to be perceived, and displayed lower confidence when incorrectly judging it as new.This finding is particularly worthy of attention: in the imagined source condition, when incorrectly attributing experiences to be perceived source, participants were just as confident as when they correctly judge an experience to be imagined.When taken together with the post-hoc result of response rate in imagined source condition (where participants were more likely to judge the imagined source as perceived), it reveals a tendency for confabulation in typical participants.
We also found, for the first time, that when measuring metacognitive ability as the (Pearson's r) correlation between accuracy and confidence ratings, metacognitive ability in the perceived source was higher compared to the imagined and new sources; these two latter source conditions were not significantly different from each other.These results suggest that during the perceived source condition, confidence better tracks task accuracy, and this ability is reduced in the imagined source condition.We also repeated this analysis using Goodman-Kruskal gamma coefficient and found the same effects and similar effect sizes (Supplementary Material Fig. S1).What might be causing this result?One possibility is that the high relatedness ratings in the imagined source condition might result in noisier encoding, causing lower confidence during judgment.A second possibility is that perceived experiences may result in higher stimulation or activity, improving the signal to noise ratio and thus enhancing metacognitive signals during this source condition.These hypotheses warrant further investigation, and future work should probe why metacognition may be better for perceived information, compared to other source types.
Although correlation provides one measure of metacognitive ability, it could be due to various reasons, including metacognitive efficiency, metacognitive bias, and metacognitive sensitivity.As Pearson's correlation cannot reveal whether metacognitive ability is driven by one of these, specifically, future work should aim to try to tease these apart.One issue that needs to be addressed to do so is that many measures of metacognitive ability (e.g., meta-d′ Maniscalco and Lau, 2012) are used for two-alternative forced-choice paradigms.Measures of metacognitive ability for three or more response types are rarer (Rahnev et al., 2022); recent work has investigated new modeling approaches for confidence judgments to incorporate three or more choices (Li and Ma, 2020), but future work will need to reveal what, specifically, underlies superior metacognitive ability for perceived information.
We also correlated all participants' estimated probability of a judgment in a source with corresponding mean confidence ratings to look at how both are associated with each other (Fig. 9).These correlations could also be performed using response rates of judgments (Supplementary Material Fig. S2), and we found the two to be similar, further suggesting that the model fit was accurate.We found that for the new source, confidence positively tracked the estimated probability of correct judgments and exhibited negative correlational trends for incorrect (perceived and imagined) judgments.For the perceived source, there was a positive (and trending towards significance) correlation when the judgment was correct; for the imagined source, imagined and perceived judgments showed positive correlation trends (although also insignificant).This result could signify source-mixing at the group level in imagined sources, as perceived and imagined judgments show positive trends in the imagined source, while in the new and perceived sources, source-mixing is not occurring.
This result was puzzling, as our findings exhibit contrasting results of metacognitive ability (Fig. 7) and correlations between estimated probability and confidence (Fig. 9).We found that for new and imagined sources, participants had lower metacognitive ability compared to the perceived source, but correlating the probability of correct judgments with confidence in the new source is higher than the perceived source and imagined source.Both methods are valid techniques for understanding the relationship between accuracy and confidence, but differ conceptually as the former is a within-subject measure, and the latter is a between-subject measure.The within-subject method measures the degree of association between accuracy and confidence within an individual, and is also sometimes referred to as "resolution"; the between-subject method measures whether the individual with high accuracy will have high confidence (Roediger III & DeSoto, 2015, Busey et al., 2000).In the memory literature, it has been shown that both correlational methods can have different degrees of association, can have opposite directions, or even no correlations between accuracy and confidence at all depending on the experimental manipulation (Busey et al., 2000, DeSoto andRoediger, 2014;Roediger andDeSoto, 2014, 2015).Thus, consistent with deviations between the two types of analysis for memory confidence, our results also show such deviations in within-subject and between-subject measures of association of accuracy and confidence at the source level.We would also like to note these analyses have not been previously attempted in reality monitoring literature.
Overall, we used our MPT model to uncover cognitive aspects of the observations made in the current study.Two of our competing models were drawn from previous research (Henkel and Franklin, 1998;Johnson et al., 1994), and three of our models tested alternative assumptions.Henkel and Franklin (1998) and Johnson et al. (1994) showed that the reality monitoring effects might be driven by differences in D x and d x (detectability and discriminability of different sources, respectively) across type of similarity and response delay.They chose to investigate a priori only one structure of MPT.Instead of a priori choosing a modeling structure, we conducted model comparisons and found that the model with d p = d I best fits the current study's data and we were interested in what parameters might be influencing judgments.Interestingly, the model with equal bias parameters (a = g) was very close to our best fit model; upon exploring each model's results, we noted that the two models contained similar parameter values.Our findings on MPT suggest that in our task, the reality monitoring effects could be driven by externalization bias (a and g).
We also used machine learning to evaluate how classifiers might predict participants' responses.As discussed earlier, one reason behind the observed response rate could be due to a task bias; one such task bias could be inherent structures or clustering among the test words into a specific source type that can be learned by the participant and bias them to respond with a specific judgment type.To invalidate the possibility of such a bias influencing behavior in our task, we trained three machine learning classifiers in different manners.We found that when the classifiers were trained to predict the true source of a word, their predicted response rate did not correlate with that of the participants.Instead, they only correlated with participants' accuracies when classifiers were trained on participants' judgments themselves.This provided evidence that the observed response rates were not the result of inherent clustering of the test stimuli but due to experiences from the orientation phase of the experiment.
Together, our study illuminates some interesting relationships between source and reality monitoring judgments for both response rates and confidence.One primary limitation is that the amount of relatedness was not controlled, nor parametrically manipulated.Although we ground our findings in past and contemporary empirical findings in reality monitoring, we do not know how the amount of relatedness between first word and second word during imagination and perception sources may be influencing our observations.Despite this limitation, we observed that when the participants are not explicitly constrained with instructions about how to guide their imagination, they are likely to imagine a highly similar word.Future work should look into how parametrically manipulating this relatedness among the sources influences response rates and confidence.
Reality monitoring of our memories is a complicated facet of our mental lives.Our findings indicate that it can fail, and that it may be influenced by how similar our imagination is to the real world (as evidenced by how imagined words were highly similar to the first word of the incomplete word pair during the orientation phase).Further, our results revealed a unique interaction between source and judgment on the perceived confidence of the judgment.Overall, our study provides a novel task to study confabulation and confidence effects during reality monitoring and provides evidence that when information is highly semantically interrelated, participants may be more error-prone than previously thought.

Fig. 2 .
Fig. 2. Experiment Design.(A) Our experiment contained three phases: an orientation phase, a response practice phase (a lexical task, not shown) and a test phase.(B)The orientation phase trial structure.This was followed by a screen which required participants to type the second word, and then rate the similarity between either the two words (perception) or the perceived word and the imagined word (imagination).(Below) The test phase trial structure was slightly different: following fixation and the presentation of a first word, participants were instructed to report whether the second word that followed (from the orientation phase) was perceived, imagined, or new.A final feedback was only displayed when the reaction time for source judgment was either faster than 75 milliseconds, or more than 350 milliseconds.

Fig. 3 .
Fig. 3. Raincloud plots of accuracy for different memory sources.Colored distributions signify the accuracy distribution for each source.Individual dots show individual particiapant (N = 34).Box plots the edges show the median and IQR of the second and third quartiles, with whiskers showing 1.5 times IQR.Red lines connect the mean of each source condition.The accuracies for imagined source judgments were significantly lower than judgments for the perceived and new sources; accuracies for new and perceived sources were not significantly different from one another.*** p < 0.001; n.s.reflects nonsignificance.

Fig. 4 .
Fig. 4. Raincloud plots of response rates for each judgment in different sources.The colored distributions signify response rate distributions for each type of judgment, separated by each of the three source types.Colored dots represent individual particpants (N = 34).The colored box plot for each type of judgment shows median and IQR of the second and third quartiles, and whiskers reflect 1.5 times IQR from the edges of the second and third quartiles.Colored lines connect mean response rates of judgments in each source condition.Results showed a main effect of judgment, with perceived judgments (green) having the highest response rates, followed by new judgments (purple) and the smallest response rate for imagined judgments (orange).Results also showed an interaction between source and judgment type: for perceived sources, participants were most likely to correctly judge the source as perceived; however, for the imagined source judgments, participants were more likely to incorrectly judge the source as perceived.For new source judgments, participants were highly likely to correctly judge it as new.

Fig. 5 .
Fig. 5. Confidence in different judgments for different source types.Colored distributions signify confidence distributions for each type of judgment for different sources.Colored dots reflect individual participants' average confidence (N = 26).The colored box plot for each type of judgment in a sourceshows the median and IQR of the second and third quartile, with whiskers denoting 1.5 times IQR from the edges of second and third quartiles.Colored lines connect average confidence across all participants within each source condition.A main effect of source was found on confidence ratings, with imagined sources being rated with the highest mean confidence, and no significant difference between confidence ratings for new and perceived sources, which were lower.Further, significant differences in confidence were identified: correct perceived judgments were rated with high confidence, compared to incorrect perceived judgments identified as new or imagined.For the imagined source (middle), there was no difference between correct imagined judgments and incorrect perceived judgments.For the new source, mean confidence for each of judgments did not significantly differ from each other.

Fig. 8 .
Fig. 8. MPT model parameters and posterior predictive checks for the best-fitting model.(A) Estimated model parameters from the best fitting model (dp = di).The red dots represent the group means and the red whiskers represent 95% credible intervals.The gray dots are mean estimates for each individual participants' (N = 34) parameter estimates.(B) Posterior predictive check using the covariances from source-judgment pairs.Red triangles represent the observed covariances and the gray boxplots and dots represent the predicted covariances in the posterior.The x-axis designates pairwise combinations, e.g., II-NI is the covariance of counts between imagined sources and imagined judgments, with new sources and imagined judgments.In other words, in each letter pair, the first letter denotes source, and the second letter denotes judgment.

Fig. 9 .
Fig.9.Correlations between the estimated probability of judgment given a source and corresponding mean confidence across participants.In the plot, the perceived judgment is green colored, the imagined judgment is orange colored and the new judgment is purple.The dots represent individual participants (N = 26) and the line represents the correlation between the response of a judgment in a given source and the corresponding mean confidence rating.The result suggests that it is only in the new source condition that confidence tracks accuracy at the group level.In the other two sources, perceived and imagined, their correlations showed trends for the relationship between confidence and accuracy, but none of the correlations were significant.n.s.signifies "not significant," i.e., p > 0.07.

Fig. 10 .
Fig.10.Correlations between participants' accuracy for a given source condition and corresponding machine learning classifier accuracy.Each row designates a given source (perceived, imagined, new) and each column reflects a different classifier (logistic regression (LR), support vector classifier (SVC), and BERT sequence classifier).The x-axis reflects a given participants' source accuracy and the y-axis reflects how accurately a machine learning classifier performed in predicting that specific participant.Blue dots reflect decoding of judgments (i.e., when a classifier is trained on participants responses) and black dots reflect decoding of sources (i.e., when a classifier is trained to predict the true source of the test stimulus)

Table 1
Model comparison using DIC and WAIC scores.

Table 2
Model Parameters in the best fitting model (d P = d I ) and their mean estimate and 95% credible intervals for the group.