How (not) to demonstrate unconscious priming: Overcoming issues with post-hoc data selection, low power, and frequentist statistics

One widely used scientific approach to studying consciousness involves contrasting conscious operations with unconscious ones. However, challenges in establishing the absence of conscious awareness have led to debates about the extent and existence of unconscious processes. We collected experimental data on unconscious semantic priming, manipulating prime presentation duration to highlight the critical role of the analysis approach in attributing priming effects to unconscious processing. We demonstrate that common practices like post-hoc data selection, low statistical power, and frequentist statistical testing can erroneously support claims of unconscious priming. Conversely, adopting best practices like direct performance-awareness contrasts, Bayesian tests, and increased statistical power can prevent such erroneous conclusions. Many past experiments, including our own, fail to meet these standards, casting doubt on previous claims about unconscious processing. Implementing these robust practices will enhance our understanding of unconscious processing and shed light on the functions and neural mechanisms of consciousness.


Introduction
Since Freud popularized the notion that a substantial portion of our mental activity occurs outside conscious awareness, the investigation of unconscious perceptual, cognitive, affective, and neural processes has become a central theme in psychology and neuroscience (Dehaene et al., 2017;Mudrik & Deouell, 2022).In recent years, there has been a surge of research that explores the boundaries and capacities of unconscious processes, shedding light on the functions and neural mechanisms underlying consciousness (Aru et al., 2012;Dehaene et al., 2001;Doerig et al., 2021;van Gaal & Lamme, 2012).However, throughout its history, the study of unconscious processes has faced controversies due to its inherent challenge of establishing complete absence of conscious awareness.As most of the tools of empirical science have been developed to establish the presence of hypothesized effects, empirical demonstrations of processing in the absence of conscious awareness have regularly encountered substantial criticism (Eriksen, 1960;Holender, 1986;Newell & Shanks, 2014), leading to ongoing debates regarding the extent and very existence of such unconscious Fig. 1.Schematic of the experiments and results for the whole samples.(a) Schematic illustration of the priming experiments, with an example of an animal and an object prime picture, and an animal and an object target word, which were crossed to yield congruent and incongruent conditions.Participants categorized the target word as quickly as possible.The awareness-check blocks were identical, except that the target words were replaced by "XXXXX", and participants categorized the prime picture as accurately as possible.In Experiment 1, a screen with a refresh rate of 60 Hz was used, so that individual frames (including the presentation of the prime) lasted 17 ms; in Experiment 2, a screen with a refresh rate of 75 Hz was used, so that individual frames lasted 13 ms.processes (Hassin, 2013;Hesselmann & Moors, 2015;Holender & Duscherer, 2004).
Most of the methodological criticism centered on establishing complete absence of awareness has been presented in theoretical frameworks, opinion pieces, and meta-analytical reviews (Berger & Mylopoulos, 2019;Peters, Kentridge, Phillips, & Block, 2017;Phillips, 2021;Rothkirch & Hesselmann, 2017;Rothkirch, Shanks, & Hesselmann, 2022;F. Schmidt, Haberkamp, & Schmidt, 2011;T. Schmidt, 2015;Shanks, 2017;Vadillo, Konstantinidis, & Shanks, 2016;Vadillo, Linssen, Orgaz, Parsons, & Shanks, 2020), providing limited direct insight into the practical impact of these issues.Are we dealing with some relatively minor problems that may, perhaps, distort effect size estimates but play little role for our general understanding of unconscious processes (Michel, 2023b(Michel, , 2023a;;Sklar et al., 2021)?Or can these potential issues even create the false impression of evidence for unconscious processing where none exists?In this study, we sought to address these questions by collecting experimental data that directly demonstrates how common but questionable methodological and analytical practices might generate false evidence for unconscious processing, leading to potentially incorrect claims about its existence.On a positive note, our findings also highlight how the adoption of recently proposed best practices in the study of unconscious processes (Dienes, 2015;Meyen et al., 2022;Rothkirch & Hesselmann, 2017;Shanks, 2017;Vadillo et al., 2016Vadillo et al., , 2020;;Zerweck et al., 2021) can help prevent such potentially erroneous conclusions.Notably, the current manuscript does not introduce any novel methodology.It applies standard methodology in the field to real data, and builds on comprehensive groundbreaking work by others that have uncovered problems with this methodology to showcase how this methodology can lead one to draw potentially unwarranted conclusions about the existence of unconscious processing (e.g., Meyen et al., 2022;Shanks, 2017;Vadillo et al., 2022).
Experimental investigations of unconscious processing typically involve comparing two measures: one indicating that some form of processing occurred, for example of a sensory stimulus, and another demonstrating the absence of conscious awareness of this stimulus.Common measures of processing include motor behavior (e.g., response priming), autonomic nervous system activity (e.g., skin conductance), or brain activity (e.g., neuroimaging).In research on perceptual awareness, subliminal priming paradigms are often employed to examine stimulus processing (Kouider & Dehaene, 2007;Van den Bussche, Van den Noortgate, & Reynvoet, 2009).These paradigms quantify the influence of a briefly presented, backward masked prime stimulus on the motor response to a subsequent target stimulus.Priming effects refer to faster responses when the target is related to the prime in some way.To establish that such priming occurs unconsciously, participants typically engage in a prime discrimination task in a separate block, with chance performance indicating absence of awareness.
While demonstrations of unconscious priming from basic visual attributes (e.g., color, shape; Klotz & Neumann, 1999;T. Schmidt, 2002;Vorberg et al., 2003) have been relatively uncontroversial, the question of whether unconscious priming can also occur based on semantic properties of visual stimuli (e.g., category membership) has been subject to debate.This would suggest that not only perceptual features but also the meaning of sensory information can be unconsciously represented (Dell'Acqua & Grainger, 1999;Van den Bussche, Notebaert, & Reynvoet, 2009).In a recent study, we conducted two backward-masking experiments to investigate semantic priming by testing whether pictures of animals and objects could facilitate the categorization of words as either animals or objects (Fig. 1a).While robust priming effects were observed with non-masked primes, these effects were absent when the primes were masked, indicating that semantic priming requires conscious awareness (Stein et al., 2020).To ensure that primes were fully unconscious, they were presented for only 13 ms.Indeed, in a separate awareness-check block prime discrimination performance did not significantly differ from chance.But is it possible that we degraded the sensory signal too much and thus missed the optimal conditions (the "sweet spot") for measuring unconscious processing?
To address this possibility, the present study employed the same stimuli, but with a slightly longer prime presentation duration of 17 ms.Despite this seemingly minor adjustment, masked primes now yielded substantial and statistically significant priming effects.But can these effects be attributed to unconscious processing?As we will show below, the answer to this question depends on the specific analysis approach.For the entire participant sample and all trials, prime discrimination performance was above chance, suggesting at least some degree of awareness of the primes.However, by adopting alternative analysis strategies commonly utilized in the field of unconscious processing, such as excluding participants with high awareness scores, assessing awareness with lower statistical power (fewer trials), and accepting the null hypothesis of absence of awareness based on a nonsignificant p-value, we were able to create the impression that our study provided evidence for unconscious semantic priming.However, these alternative analyses strategies rely on questionable or outright false statistical assumptions.Interpreting the results of our new study as evidence for unconscious semantic priming would thus be premature.Here, rather than decisively determining whether unconscious semantic priming from pictures does or does not exist, our main goal was to showcase how these questionable practices can result in contradictory and possibly misleading conclusions.
The impact of several of these practices has been investigated using simulations (e.g., Miller, 2000;Sand & Nilsson, 2016;Shanks, 2017).For example, through simulations these studies have revealed that analysis strategies such as post-hoc selection of participants with low awareness scores or concluding absence of awareness based on a non-significant p-value can result in a large false positive rate of erroneously supporting unconscious effects.Here, to provide an intuitive and comprehensive demonstration of their impact on conclusions about unconscious processing, we applied all of these practices to real empirical data collected specifically for that purpose.Note that because the present study used real empirical rather than simulated data, the "ground truth", i.e. whether unconscious semantic priming occurred or not, was unknown.While demonstrating the impact of different analysis practices on inferences about unconscious processing, we thus cannot conclusively tell whether these practices actually resulted in false inferences when analyzing the current data.However, simulation studies (e.g., Sand & Nilsson, 2016;Shanks, 2017) suggest that data patterns similar to those obtained in the present study can be accounted for by conscious (rather than unconscious) processing, indicating that questionable analysis choices may lead to false inferences about the existence of unconscious processing.
Given that many demonstrations of unconscious processing rely on one or more of these questionable practices, our findings have implications for the validity of numerous studies investigating the scope and limitations of unconscious processing, extending beyond the realm of subliminal priming research.Consequently, they also impact our understanding of the functions and neural mechanisms underlying consciousness.Moreover, using our priming data, we showcase how recently proposed best practices for studying unconscious processing, such as employing Bayesian statistics to support the null hypothesis of absence of awareness (Dienes, 2015) or conducting additional direct statistical comparisons between processing and awareness (Meyen et al., 2022;Zerweck et al., 2021), can safeguard against possibly erroneous inferences.

Method
We report the new experiment with 17-ms primes as Experiment 1, and, for comparison, the nearly identical experiment with 13-ms primes as Experiment 2 (previously reported as Experiment 2 in our earlier study; additional methodological details can be found in our previous article by Stein et al., 2020).Both experiments comprised a masked-priming block, a non-masked-priming block, and an awareness-check block.In Experiment 1, the order of blocks was masked-priming, awareness-check, and non-masked priming, while in Experiment 2 block order was counterbalanced across participants.

Participants and participant exclusion
Participants were Dutch native speakers with normal or corrected-to-normal vision.Experiment 1 included a new set of 45 participants, while Experiment 2 involved 76 participants.Sample sizes were not determined a priori, for example, through power analysis.In Experiment 1, participants were students from an undergraduate study program who participated in the context of their introduction to psychology class.All of them were naïve to the research question, and we tested all students from that class who were willing and available to volunteer.For Experiment 2, participants were recruited from the University of Amsterdam subject pool and we tested as many participants as possible in the time frame allocated to the study.To maintain consistency with our previous study, we applied the same exclusion criteria, which involved excluding participants with median response times (RTs) exceeding 700 ms or error rates above 25 %.As a result, one participant was excluded from Experiment 1, resulting in a final sample of 44 participants (18 male, mean age 19.1 years), and 11 participants were excluded from Experiment 2, resulting in a final sample of 65 participants (14 male, mean age 24.8 years).

Apparatus and display
In Experiment 1, the stimuli were presented on a 21-inch Dell P2412H LCD monitor with a resolution of 1920 x 1280 pixels and a refresh rate of 60 Hz, allowing the primes to be shown for 16.7 ms.For this screen, we did not check the timing with a photodiode.In Experiment 2, we used a 19-inch Iiyama Vision Master Pro 510 (A201HT) CRT monitor with a resolution of 1024 x 768 pixels and a refresh rate of 75 Hz, enabling us to display primes for 13.3 ms to replicate the timing settings used in a previous study (Van den Bussche et al., 2009).The accuracy of our screen's timing was confirmed using a photodiode.Both experiments were programmed using Matlab and the Psychtoolbox functions (Brainard, 1997).Participants viewed the screens from a free viewing distance of approximately 60 cm.

Stimuli
The prime stimuli used in our study were 50 line drawings, consisting of 25 animals and 25 non-animal objects.These drawings were originally selected from the set of grayscale images of the "Snodgrass and Vanderwart-like" objects (Rossion & Pourtois, 2004;Van den Bussche, Notebaert, & Reynvoet, 2009), set to 140 x 140 pixels.In the priming blocks, the target words were the Dutch names of the prime images, presented in black Arial font (capital letters, size 20 points) on a white background.The category of the target and prime stimuli (animals vs. non-animal objects) was manipulated across trials, resulting in four prime-target conditions: object-OBJECT, animal-ANIMAL, object-ANIMAL, and animal-OBJECT.Identity conditions, such as cat-CAT, were not included.There were a total of 100 prime-target pairs, which were presented twice, resulting in 200 trials per priming block.All participants were presented with the same prime-target pairs in a randomized order.For the awareness check, the target words were replaced with "XXXXX" strings, and each of the 50 prime images was presented twice, yielding 100 trials.

Priming blocks
At the beginning of the priming blocks, participants were instructed to categorize words as either objects or animals as quickly and accurately as possible using the left and right arrow keys on a standard keyboard.The assignment of buttons to target categories (animal vs. object) was counterbalanced between participants.
In the masked-priming block, participants were not informed about the presence of primes and were only instructed to respond to the target words.Each trial in this block started with the presentation of a fixation cross for 400 ms, followed by a forward mask consisting of four different random noise patterns in succession, each presented for one frame (16.7 ms in Experiment 1; 13.3 ms in Experiment 2, see Fig. 1A).Next, the prime image was shown for one frame, followed by a blank screen for two frames, and another set of four individual noise patterns serving as backward masks, each presented for one frame.Finally, the target word was presented until T. Stein et al. the participant responded.The inter-trial interval was one second.
Before the non-masked-priming block, participants were informed that the primes were irrelevant to their task and not predictive of the target category.The trial structure in this block was the same as in the masked-priming block, except that all masks were replaced by blank screens.

Awareness-check block
The procedure in the awareness-check block closely resembled that of the masked-priming block, with the key difference being that participants were informed about the presence of primes and asked to categorize them as accurately as possible without time pressure, guessing when necessary.The same left and right arrow keys were used for categorization as in the priming blocks.

Data pre-processing
For the priming blocks, trials with incorrect responses and trials with response times (RTs) faster than 0.1 s or slower than 1.0 s were excluded from all analyses.The mean percentage of excluded trials (average of the participant-wise percentages) was as follows: Experiment 1 masked (7.4%), Experiment 1 non-masked (10.3%),Experiment 2 masked (6.3%), and Experiment 2 non-masked (8.5%).1 Median RTs were calculated separately for congruently and incongruently primed targets.For the prime awareness check, we calculated the proportion of correct responses as well as the sensitivity index d', considering animal primes that were correctly categorized as hits and object primes that were incorrectly categorized as false alarms.Due to the small number of trials in some of our analyses, we used proportion correct as the main dependent variable for the awareness check instead of d' (with low cell counts, d' values are not appropriate, because they can become extreme due to the z-transform).

Results and discussion
We will begin by reporting the results from the whole data sets, which include all subjects and trials.Subsequently, we will explore how alternative analysis choices could influence and reverse the interpretation of these findings.Finally, we will showcase several improved methods that prevent such possibly erroneous conclusions.

Whole data sets
Overall RTs and 95% confidence intervals for congruent and incongruent trials in the non-masked and masked condition from Experiment 1 are shown in Fig. 1b and 1c, respectively.In Experiment 1 (17-ms prime presentation), significant priming effects were observed in both the non-masked condition, t(43) = 11.34,p < .001,Cohen's d = 1.71 (Fig. 1b), and, critically, in the masked condition, t(43) = 2.92, p = .005,Cohen's d = 0.44 (Fig. 1c and 1d).However, results from the awareness check revealed that participants' performance in categorizing the prime images was significantly better than chance.This was evident when calculated as proportion correct (M = 54.7%correct, t(43) = 4.32, p < .001,Cohen's d = 0.65) and when expressed in d' (M = 0.24, t(43) = 4.29, p < .001,Cohen's d = 0.65, Fig. 1e).As this indicates some level of prime awareness, the significant priming effect in the masked condition cannot be attributed to fully unconscious processing.
Overall RTs and 95% confidence intervals for congruent and incongruent trials in the non-masked and masked condition from Experiment 2 are shown in Fig. 1f and 1 g, respectively.In Experiment 2 (13-ms prime presentation), a priming effect was only observed in the non-masked condition, t(64) = 15.00,p < .001,Cohen's d = 1.86 (Fig. 1f), while there was no significant effect in the masked condition, t(64) = 0.50, p =.62, Cohen's d = 0.06 (Fig. 1g and 1 h).The awareness check results indicated that prime discrimination did not significantly differ from chance.Accuracy was M = 49.7%correct (t(64) = − 0.56, p = .58,Cohen's d = − 0.07), and the d' value was M = − 0.02 (t(64) = − 0.54, p = .59,Cohen's d = − 0.07, Fig. 1i).These findings demonstrate that when prime awareness approximates zero for the entire sample, the priming effects also vanish, providing evidence against the possibility of fully unconscious semantic priming from pictures, in this task set-up.
To summarize, when analyzing the whole data sets, evidence for semantic picture priming was only observed when prime duration was set to 17 ms, resulting in above-chance awareness, but not when prime duration was set to 13 ms and awareness approached chance levels.
At this point, we could have concluded that semantic priming from pictures necessitates awareness and cannot occur unconsciously.However, we will now demonstrate how alternative analysis approaches commonly employed in studies on unconscious priming can lead to different and even contrasting conclusions.

Post-hoc participant selection
A common strategy in the literature on unconscious processing is to post-hoc exclude participants who exhibit relatively high levels of awareness (e.g., see the discussions by Shanks, 2017;Yaron et al., 2023).The rationale behind this approach is to create a sub-sample representing genuinely "unconscious" processing.However, in the presence of measurement noise, awareness scores do not accurately reflect actual awareness.Selecting participants with awareness scores below a specific cutoff will therefore result in a subsample that overrepresents participants whose true awareness scores are higher than the measured ones, leading to an underestimation of this sub-sample's true awareness (Rothkirch et al., 2022;Shanks, 2017).The cutoffs for participant exclusion are often arbitrarily determined, but a common practice is to use a binomial test to assess individual participants' prime discrimination accuracy against chance and exclude those whose performance significantly deviates from chance (p < .05).
Applying this approach to Experiment 1, our initial sample of 44 participants reduced to 30 participants (with 13 participants showing significantly better than chance performance, and one participant showing significantly worse than chance performance).Despite the reduction in sample size, the priming effect remained statistically significant for this "unconscious sub-sample", t(30) = 2.76, p = .010,Cohen's d = 0.50, with no decrease in effect size compared to the full sample (Fig. 2a and 2b).Importantly, the awareness check data for the sub-sample indicated the success of the selection procedure: prime discrimination did not significantly differ from chance (accuracy: M = 51.4% correct, t(29) = 1.79, p = .084,Cohen's d = 0.33; d': M = 0.07, t(29) = 1.78, p = .085,Cohen's d = 0.33, Fig. 2c and 2d).
It is worth noting that achieving these results by excluding 14 out of 44 participants (i.e., 32% of the initial sample) represents an extreme form of post-hoc data exclusion.Although such proportions are not unprecedented in the literature (Desender, Van Wentura & Frings, 2005), in particular when methods with large between-subject variability such as continuous flash suppression are used (Goldstein et al., 2020;Sklar et al., 2012), based on our own work using backward masking, a more common practice appears to involve excluding 10-20% of participants post-hoc based on their awareness score (Fahrenfort, Scholte, Lamme, Fahrenfort, & H. s. v. l., 2007;van Gaal, Ridderinkhof, Scholte, & Lamme, 2010;Wokke, van Gaal, Scholte, Ridderinkhof, & Lamme, 2011).In the following sections, we will illustrate how we could have "optimized" our post-hoc exclusion protocol by using a different cutoff and by reducing the statistical power of the awareness check.

Iterative post-hoc selection
We implemented an iterative post-hoc selection procedure in which participants were sorted based on their prime discrimination accuracy, and we iteratively excluded those with the highest accuracy (Fig. 3a, purple line).Starting from our full sample, we excluded participants until we reached a level of prime discrimination that was not significantly better than chance anymore.This procedure resulted in the exclusion of 11 participants (25% of our sample) to achieve the desired prime discrimination level.We then sorted participants' priming effects based on these awareness scores (Fig. 3b and 3c, purple lines).Interestingly, the priming effects remained largely unaffected by this iterative subject-exclusion procedure.In the masked condition, significant priming effects were observed even after excluding up to 25 participants.Only when more than 25 participants were excluded did the masked priming effects become too variable to reach statistical significance.Thus, by excluding the 11 to 25 participants with the highest awareness scores, we obtained significant masked priming effects while awareness scores were not significantly above chance (Fig. 3b, purple line).
Another common issue in studies on unconscious processing is that the awareness measure is often collected with (substantially) less power (trials) compared to the processing measure (Kouider, Dehaene, Jobert, & Le Bihan, 2007;Kunde, 2003 Dehaene et al., 1998;Fahrenfort et al., 2012), although this is not always the case (e.g., Ansorge & Neumann, 2005;Dijkstra et al., 2021;Mongelli et al., 2019;T. Schmidt, 2002;Stein et al., 2021;Stein & Peelen, 2021;Vorberg et al., 2003).To assess how this can impact conclusions about unconscious processing, we conducted an analogous iterative post-hoc selection strategy using only the first 50 out of 100 trials from the awareness-check block to determine the sub-sample and to measure awareness, effectively reducing the power of the awareness measure (Fig. 3a-c, orange lines).With this approach, excluding only the two participants with the highest awareness scores (5% of our sample) was sufficient to obtain a significant masked priming effect while awareness scores were not significantly above chance (Fig. 3b, orange line).
For comparison, we also applied the same iterative post-hoc selection procedure to Experiment 2 (with shorter prime duration), sorting and iteratively excluding participants based on their awareness scores in descending order (Fig. 3d).Interestingly, regardless of the number of included trials from the awareness check and the number of included participants, it was impossible to create a scenario of priming without awareness through post-hoc participant exclusion (Fig. 3e).In the masked condition, the priming effect never reached statistical significance.In the non-masked condition, the priming effects were consistently strong for all post-hoc selected subsamples, similar to Experiment 1 (Fig. 3c and 3f).
In summary, our iterative post-hoc selection procedure illustrates how easily a priming-without-awareness pattern can be created through post-hoc participant sub-sampling.Moreover, this issue is exacerbated by low statistical power in the awareness check.Essentially, combining post-hoc participant selection with low power effectively capitalizes on noise in the awareness data, creating the false impression of absence of awareness in the selected subjects.However, it should be noted that even with these analysis choices, priming and awareness appeared largely independent (Fig. 3b).At first glance, this may suggest support for the notion of a distinct unconscious process underlying priming, dissociated from the conscious process underlying prime discrimination.However, as we will demonstrate in the following sections, this apparent independence can be attributed to the low reliability of our measures (Hannula et al., 2005;Malejka et al., 2021).

Unreliable effects
Fig. 4a shows that in Experiment 1, there was virtually no correlation between individual participants' masked-priming effects and prime discrimination accuracy, r(42) = 0.06, p = .70.However, this lack of correlation can be attributed to the low reliability of the measures.2In fact, there was no significant correlation between odd and even trials for the masked-priming effect, r(42) = 0.01, p = .97(Fig. 4b).Even for the priming effect in the non-masked condition the correlation between odd and even trials was only of medium size, r(42) = 0.31, p = .043(Fig. 4c).Such low reliabilities for individual differences are common in tasks in cognitive psychology and neuroscience that produce robust within-subject effects but exhibit low between-subject variability (Hedge et al., 2018;Vadillo et al., 2022).The awareness-check also showed a relatively low correlation between odd and even trials, r(42) = 0.48, p = .001(Fig. 4d).Since the reliabilities of the two measures provide an upper bound for the maximum expected correlation between these variables, the apparent independence of the priming effect and awareness is simply a consequence of the low reliabilities and does not provide evidence for unconscious processing (Malejka et al., 2021;Vadillo et al., 2022).
Analogous analyses for Experiment 2 revealed similarly low correlations (Fig. 4e-h), creating the misleading impression that masked-priming effects were independent of awareness.Moreover, in Experiment 2, the odd-even correlation was substantially lower and not significant, r(63) = − 0.10, p = .43(Fig. 4h), which is consistent with participants randomly guessing in the awareness check.These analyses demonstrate that the apparent independence of awareness and processing measures cannot be interpreted without considering the reliability of the measures.Furthermore, the presence of substantial noise in the awareness-check data, even when group-level awareness is above chance as in Experiment 1, highlights how post-hoc participant selection capitalizes on random differences between participants.
Taken together, the previous sections illustrate how common analysis practices can create the impression that Experiment 1 provided evidence for unconscious semantic priming from pictures.These practices rely on strategies such as post-hoc participant selection, testing the selected participants on the same non-independent awareness data, employing frequentist statistics to conclude absence of awareness based on nonsignificant results, and using the "double t-test approach" to test processing and awareness separately against chance to support the claim of processing without awareness (Meyen et al., 2022).In the following sections, we will demonstrate how improved practices can prevent potentially erroneous conclusions based on these problematic strategies.

Selecting and testing the "unconscious sub-sample"
As discussed above, selecting and testing the post-hoc selected "unconscious sub-sample" based on individual awareness scores assumes that measured prime discrimination accurately reflects a participant's true ability to discriminate the primes.However, measured prime discrimination scores are influenced by measurement error, resulting in a combination of prime discrimination ability and noise.Consequently, post-hoc selecting participants with low awareness scores as supposedly "unconscious" capitalizes on the noise in the dataset, leading to an underestimation of awareness in the post-hoc selected sub-sample.If the post-hoc selected subsample were retested, it would be expected to yield higher awareness scores due to regression to the mean (Rothkirch et al., 2022;Shanks, 2017).
This selection procedure is problematic because the same data used to create the "unconscious sub-sample" are then used to demonstrate the selected sub-group's lack of awareness (i.e., not significantly better than chance).Performing post-hoc data selection and testing on the same (i.e.non-independent) data is a circular analysis strategy akin to "double dipping" (Kriegeskorte et al., 2009).In essence, the data points that support the hypothesis are selected after observing the results and then used to confirm the hypothesis.As Schmidt (2015) succinctly put it: "Selecting only those participants […] that meet a specified visibility criterion is analogous to testing a new medication and then discarding all those patients who die from it, concluding that all "suitable" patients do fine under the new drug.".
One possible solution is to employ independent datasets for selecting and testing the potentially unconscious sub-group of participants (Shanks, 2017; for alternative approaches, see Leganes-Fonteneau et al., 2021;Rothkirch et al., 2022;Yaron et al., 2023).Following Schmidt's (2015) analogy, patients could be screened for eligibility in a trial testing a new medication, allowing only suitable patients to enter the trial; whether these patients then do fine (or die) under the new drug will be evaluated in a next step, using data independent from the screener.In the context of our masked-priming experiment, this could amount to using half of the trials from the awareness check to select participants with low awareness scores and the other half to test awareness in the selected subgroup.
Adopting this approach, we selected participants based on their prime discrimination accuracy in odd trials, again employing a binomial test to exclude participants whose performance significantly differed from chance (p < .05).This resulted in a sub-sample of 35 participants out of 44.The sub-sample's prime discrimination accuracy in odd trials (used for selection) was 50.9% correct, which did not significantly differ from chance (t(34) = 1.34, p = .19,Cohen's d = 0.22).However, testing this sub-sample using independent data revealed a different pattern.In even trials, the sub-sample's accuracy was 54.1% correct, significantly better than chance (t(34) = 3.19, p = .003,Cohen's d = 0.54).The plotted data exhibited the expected regression-to-the-mean pattern (Fig. 5a), with participants selected for extreme scores in one half of the trials showing less extreme scores in the other half.
This outcome is expected given the considerable measurement error in our data, as reflected in the relatively weak correlation between odd and even trials (Fig. 4d).Moreover, this pattern is not unique to a particular cut-off or trial selection.We also sorted participants in descending order according to their awareness scores in odd trials (or in the first half of all trials), and iteratively excluded participants with the highest scores (Fig. 5b).With this procedure as well, when awareness scores were measured independently in even trials (or the second half of all trials), they consistently exceeded chance level for the majority of sub-samples (Fig. 5c).
Hence, using independent data for selecting and testing awareness led to entirely different conclusions compared to using nonindependent data.The analysis based on independent data failed to provide evidence for the absence of awareness and, consequently, did not support the notion of unconscious semantic priming.While selecting and testing an "unconscious sub-sample" using the same data represents an invalid strategy that capitalizes on noise to create a false impression of absent awareness (a "sampling fallacy", e.g.see F. Schmidt et al., 2011), the use of independent data for selection and testing appears to be a promising solution (Shanks, 2017).However, implementing this approach in practice may be challenging, requiring greater statistical power (trials and participants) in the awareness check to reduce noise and enhance reliability.Another limitation is that with such sub-sampling inferences would be confined to the subpopulation that performs poorly in the awareness check.
Another solution to the sub-sampling problem is to estimate the true awareness in the sub-sample by estimating the bias in the measured awareness score, using a formula provided by Rothkirch and colleagues (2022), which takes into account measured awareness across the whole sample and an estimate of the awareness score's reliability (which we measured as the correlation between odd and even trials, see above).Applying this bias-correction formula to the awareness scores of the sub-sample of Experiment 1, the measured awareness score of 51.4% increased to 53.1% correct, a value that is unlikely to be considered "unconscious."

Statistically demonstrating absence of awareness
So far, we followed the conventional approach of demonstrating absence of awareness by using standard frequentist statistical tests to show that awareness scores did not exceed chance performance (p > .05).However, the p-value returned by standard frequentist statistical tests simply reflects the probability of the observed effect under the null hypothesis of no effect.However, p-values from In all plots, y-axis scales are chosen so that values higher up on the y-axes provide more evidence for absence of awareness.
these tests only indicate the probability of the observed effect under the null hypothesis and do not provide evidence for the null hypothesis itself.Large, non-significant p-values may result from low statistical power or high variability.To provide more convincing statistical evidence for absence of awareness, two alternative approaches can be employed: equivalence tests (Lakens et al., 2018) and Bayesian statistics (Dienes, 2015).
Equivalence tests, specifically the two one-sided tests (TOST) procedure, can determine if an observed effect is surprisingly small assuming a true effect of at least a pre-defined smallest effect size of interest (SESOI) exists (Lakens et al., 2018).Thus, the meaning of a p-value in the TOST procedure changes from reflecting the probability of observing an effect that is as large or larger given that the true effect does not exist (the probability of a type I error), to reflecting the probability of observing an effect that is as small or smaller given that the SESOI does exist (the probability of a type II error).In other words, in the TOST procedure a significant result means that the null hypothesis of an effect at least as large as the SESOI can be rejected.The challenge lies in setting the SESOI, which involves subjectively defining the highest awareness score that still indicates the absence of awareness.For our demonstration, we set the SESOI to a d' score of 0.10 and a proportion correct of 0.52, respectively (smaller than the means obtained for the entire sample in Experiment 1, implying no significant results can be expected).
For the post-hoc selected sub-sample from Experiment 1 (selected based on binomial tests on non-independent data, p > .05), the TOST procedure (using the spreadsheet TOST calculators provided by Lakens and colleagues) revealed that the observed effects (d': M = 0.07, SD = 0.22; accuracy: M = 0.51, SD = 0.04) were not significant for d' (t(29) = − 0.68, p = .25)or accuracy (t(29) = − 0.71, p = .24).Despite the questionable data selection based on non-independent data, the awareness scores for the supposed "unconscious" subsample from Experiment 1 were not surprisingly small given the SESOI.In contrast, for the entire sample from Experiment 2, the observed effects (d': M = − 0.02, SD = 0.24; accuracy: M = 0.50, SD = 0.05) were significant for both d' (t(64) = 2.85, p = .003)and accuracy (t(64) = 2.90, p = .003).This indicates that the observed awareness scores were surprisingly small compared to the SESOI, providing further evidence for genuine absence of awareness in Experiment 2.
To illustrate how these conclusions depend on the SESOI, we tested a range of SESOIs for both d' (Fig. 6a) and accuracy (Fig. 6b) scores.For example, setting the SESOI to a d' score of 0.16 would yield a significant TOST even for the post-hoc selected sub-sample in Experiment 1. Conversely, setting the SESOI to a d' score of 0.04 would result in a non-significant TOST even for Experiment 1. Plotting these results over a range of SESOIs provides more insight into the uncertainty inherent in claims of absence of awareness.
Alternatively, the Bayesian statistical framework can be employed to estimate the likelihood of the observed data under the null hypothesis compared to the alternative hypothesis, making it particularly useful for studying unconscious processes (Dienes, 2015).Bayesian t-tests also require a subjective decision when specifying the prior distribution for the alternative hypothesis.Using the default from the JASP software package (Cauchy distribution with scale 0.707) and the labels to classify Bayes factors from Lee and Wagenmakers (2014), Bayesian t-tests for the entire sample from Experiment 1 showed "extreme" evidence for above-chance performance for both d' and accuracy (BF 10 > 238).For the post-hoc selected sub-sample (selected based on binomial tests on nonindependent data, p > .05), the null hypothesis and alternative hypothesis were similarly likely, resulting in only "anecdotal" evidence for chance-level performance for both d' (BF 01 = 1.27) and accuracy (BF 01 = 1.25).In contrast, for the entire sample from Experiment 2, there was "moderate" evidence for the null hypothesis for both d' and accuracy (BF 01 = 6.39 and BF 01 = 6.33, respectively).
Again, it is important to note that these estimates depend on the prior distribution (illustrated in Fig. 6c for d' and in Fig. 6d for accuracy).Narrower priors, which assign more weight to smaller effect sizes, decrease the probability of the null hypothesis.For example, using a normal distribution with M = 0 (for d'; M = 0.5 for accuracy) and SD = 0.5, the Bayes factors supporting the null hypothesis for d' and accuracy decrease to BF 01 = 0.74 and BF 01 = 0.73, respectively, for the reduced sample from Experiment 1.Even with this narrower prior, Experiment 2 still yields "moderate" evidence for both d' and accuracy (BF 01 = 3.61 and BF 01 = 3.58, respectively).Presenting results across a range of priors helps highlight the uncertainties inherent in claims of absence of awareness.
Thus, for both methods (TOST or Bayesian), whether a null effect of no awareness can be shown statistically directly depends on the smallest effect size that one is interested in (SESOI) or on the peak and distribution of the prior that one sets regarding the strength and plausible distribution of non-zero effect sizes of detection that one deems plausible.

Directly comparing processing vs. Awareness
All previous analyses represent some form of the so-called "double t-test approach" (Meyen et al., 2022), where processing and awareness are separately tested against zero (also see Nieuwenhuis et al., 2011).However, it has been argued that this approach is fundamentally flawed.Instead, it is suggested that unconscious processing should be demonstrated by directly comparing processing effects to awareness scores, typically after transforming them to the same scale (Meyen et al., 2022;Reingold & Merikle, 1988, 1990;T. Schmidt, 2002;T. Schmidt & Vorberg, 2006;von Luxburg & Franz, 2015;Zerweck et al., 2021).
One motivation for this direct comparison is that processing and awareness are often measured on different scales (typically, processing on a continuous scale [e.g., RTs], awareness on a binary scale [e.g., correct/incorrect]) and have differences in statistical power (typically, more trials for the processing measure).Finding an effect in processing but not in awareness may simply reflect these scale and power differences.Additionally, while the processing measure represents average differences between conditions calculated over many trials (e.g., average RTs to different primes), the awareness measure reflects participant's trial-by-trial ability to classify these conditions (e.g., perceptual sensitivity for different primes).Simulations and empirical data have shown that, even with identical underlying data, processing is more likely to yield a significant result than the awareness measure (Meyen et al., 2022;von Luxburg & Franz, 2015).
On a conceptual level, the double t-test approach assumes that the awareness measure is exhaustively sensitive to all relevant T. Stein et al. conscious information and, at the same time, exclusively sensitive to conscious information, which is unlikely to be met.By directly comparing processing and awareness, these assumptions can be relaxed, only requiring that the awareness measure is more sensitive to conscious information than the processing measure.
To test whether the findings from the masked condition of Experiment 1 would survive the direct processing-vs.-awarenesscomparison, we transformed RTs to d' scores, using the median split technique (categorizing congruent trials with RTs faster than the median as hits, and incongruent trials with RTs slower than the median as correct rejects, etc. [Meyen et al., 2022;T. Schmidt, 2002]).This transformation did not distort the obtained effects.As can be seen in Fig. 7a, the correlation between the priming effect measured in raw RTs and transformed d' scores was very strong, r(42) = 0.86, p < .001.Also the effect sizes were similar for both the whole sample (with raw RTs, t(43) = 2.93, p = .005,Cohen's d = 0.44, in d' scores, t(43) = 2.92, p = .006,Cohen's d = 0.44) and the sub-sample post-hoc selected on non-independent data (with raw RTs, t(29) = 2.76, p = .010,Cohen's d = 0.50, in d' scores, t(29) = 2.76, p = .010,Cohen's d = 0.44).
However, after transformation to d', the priming effects did not exceed prime discrimination in d' (Fig. 7b and 7c).When directly comparing priming vs. awareness, for the whole sample prime-discrimination sensitivity was significantly higher than the priming effect, t(43) = 3.17, p = .003,Cohen's d = 0.48, and for the sub-sample the difference between priming and awareness was not significant, t(43) = − 0.25, p = .81,Cohen's d = − 0.04 (Fig. 7d).Thus, with this approach there was no evidence for unconscious priming even for the sub-sample selected for low awareness.This demonstrates how the direct comparison between processing and awareness can prevent incorrect conclusions, even when potentially problematic data selection practices are used.
In addition to sidestepping the issues with supporting the null hypothesis discussed above, there are further practical benefits of replacing the double t-test approach with the direct processing-awareness comparison.For instance, in the double t-test approach low power and significant variability in the awareness measure render a non-significant effect and thus false claims about unconscious processing more likely.In contrast, when directly comparing a low-powered, noisy awareness measure to the processing measure, a significant difference and thus false claims about unconscious processing become less likely.
Furthermore, the direct comparison approach highlights the invalidity of post-hoc sub-sample selection based on non-independent awareness data, as it would involve removing data inconsistent with the investigator's hypothesis from the key statistical test.Implementing the direct processing-awareness comparison not only serves as a conceptual and statistical requirement for demonstrating unconscious processing but also encourages improvements in experimental design, analyses, and interpretation.

Conclusion
In conclusion, our findings underscore the significance of addressing methodological issues in order to advance research on unconscious processing.It should be emphasized that the present results do not settle the debate of whether unconscious semantic priming from pictures does or does not exist.Using real empirical data, the ground truth, i.e. whether unconscious semantic priming did or did not occur in Experiment 1, was unknown.To determine whether the adopted questionable analysis choices actually lead to false inferences, simulation studies, in which the ground truth is known, are necessary.Such simulation studies show that data patterns similar to those obtained in Experiment 1 after applying these questionable analysis choices are unlikely to reflect genuine unconscious processing (e.g., Sand & Nilsson, 2016;Shanks, 2017).However, it is possible that we still missed the "sweet spot" for measuring priming effects in the absence of awareness, as we did not titrate stimulation parameters (e.g., stimulus timing, contrast) to individual participants (e.g., Hesselmann et al., 2018;Rothkirch & Hesselmann, 2018).Instead, our goal was to provide an empirical demonstration how common but potentially problematic analysis choices can lead to contradictory and possibly misleading conclusions about unconscious processes.
The identified key issues, including the double t-test approach, post-hoc subject exclusion using non-independent data, low statistical power, and unreliability of effects and measures, collectively contribute to inconsistencies within this field.Although a systematic review is beyond the scope of this paper, it is clear that these issues are prevalent in cognitive neuroscience studies on consciousness that contrast conscious with unconscious processes to reveal the functions and neural correlates of consciousness.This contrastive approach relies on accurate measurement of the scope and limits of unconscious processes, because overestimation of the extent of unconscious processing necessarily results in underestimation of conscious processes.However, most if not all studies that purportedly demonstrate unconscious high-level processing involve one or several of the identified issues, and their findings may thus reflect an undue overestimation of unconscious processes.For example, although statistically invalid, the double t-test approach is the currently dominant approach in the field (for a recent overview, see Meyen et al., 2022), behavioral studies on unconscious priming often involves post-hoc subject exclusion using non-independent data (e.g., reviewed in Shanks, 2017), many neuroimaging studies tend to measure awareness with low statistical power (discussed in Stein et al., 2021), and reliability estimates are rarely if ever provided.
However, there are effective strategies to mitigate these issues and enhance the rigor of unconscious processing research.One effective approach involves adopting direct performance-awareness contrasts, which can replace the problematic double t-test approach and provide a more direct assessment of the relationship between processing and awareness.Additionally, employing equivalence tests or Bayesian tests can elucidate the inherent uncertainty in establishing the absence of awareness.When subject exclusion is necessary, it is imperative to rely on independent data to mitigate biases introduced by using the same data for subject selection and testing.Furthermore, increasing statistical power is crucial.Higher-powered studies not only enhance the likelihood of detecting genuine unconscious effects but also reduce the risk of erroneously concluding the absence of awareness.
Considering these challenges, it is important to acknowledge that for the time being evidence for unconscious processing is not a dichotomous outcome.The strength of evidence for unconscious processing varies depending on the methodological choices made in a study.The weakest form of evidence arises from the combination of the double t-test approach, non-independent subject exclusion, and low statistical power in awareness checks.Gradual improvements can be achieved by implementing more robust methodologies, such as Bayesian tests with a range of priors and direct performance-awareness comparisons.
In summary, addressing methodological pitfalls and embracing more rigorous approaches will contribute to a better understanding of unconscious processing, paving the way for more accurate and reliable insights into the functions and neural mechanisms of consciousness.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Fig. 1.Schematic of the experiments and results for the whole samples.(a) Schematic illustration of the priming experiments, with an exampleof an animal and an object prime picture, and an animal and an object target word, which were crossed to yield congruent and incongruent conditions.Participants categorized the target word as quickly as possible.The awareness-check blocks were identical, except that the target words were replaced by "XXXXX", and participants categorized the prime picture as accurately as possible.In Experiment 1, a screen with a refresh rate of 60 Hz was used, so that individual frames (including the presentation of the prime) lasted 17 ms; in Experiment 2, a screen with a refresh rate of 75 Hz was used, so that individual frames lasted 13 ms.(b-d) Results for the whole sample from Experiment 1, showing (b) RTs from the non-masked condition, (c) RTs from the masked condition, (d) the priming effect (RTs in the incongruent condition minus RTs in the congruent condition), and (e) prime discrimination sensitivity in the awareness check.(f-i) Results from Experiment 2. Gray circles represent individual participants, black circles the group means, and error bars the associated 95% confidence intervals.
Fig. 2. Results from Experiment 1 after post-hoc selection.(a) RTs from the masked condition for all participants, the post-hoc selected subsample, and the excluded sub-sample.(b) The priming effect (difference between incongruent and congruent conditions).(c) Prime discrimination sensitivity and (d) prime discrimination accuracy in the awareness check.Means are represented by black circles for the whole sample, by pink circles for the post-hoc selected sub-sample, and by green circles for the excluded sub-sample.Error bars represent the associated 95% confidence intervals, and light circles represent individual participants.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 3 .
Fig. 3. Iterative post-hoc participant selection based on prime discrimination accuracy.(a) Participants from Experiment 1 were sorted in descending order according to their prime discrimination accuracy and then iteratively excluded, going from the full sample to the five participants with the lowest awareness scores.Here and in sub-plot (d), purple and orange lines represent the sub-sample's mean prime discrimination accuracy.(b, c) This sorting based on awareness scores was then applied to the (b) priming effects in the masked condition and to the (c) priming effects in the non-masked condition.In sub-plots (b, c, e, f), purple and orange lines represent the sub-sample's mean priming effect.(d-f) The same procedures applied to the data from Experiment 2. In all subplots, purple lines refer to analyses including all 100 trials from the awareness check, and orange lines refer to analyses restricted to the first 50 trials from the awareness check.Shaded error bars represent 95% confidence intervals.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 4 .
Fig. 4. Reliability of priming and awareness measures.(a) Correlation between the masked-priming effect and prime-discrimination accuracy in Experiment 1.(b) Correlation between the masked-priming effect in even and odd trials in Experiment 1. (c) Correlation between the priming effect in non-masked even and odd trials in Experiment 1.(d) Correlation between prime-discrimination accuracy in even and odd trials in Experiment 1. (e-h) Analogous analyses for Experiment 2. Circles represent individual participants (for Experiment 1, pink circles represent selected participants based on a binomial test against chance, p > .05;green circles represent excluded participants, p < .05),solid black lines the best-fitting regression lines, and dashed lines the associated 95 % confidence intervals.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 5 .
Fig. 5. Sub-sampling and testing on independent awareness-check data in Experiment 1.(a) Prime discrimination accuracy in odd and even trials for all participants, the sub-sample selected for not being significantly different from chance in odd trials, and the excluded sub-sample.Means are represented by black circles for the whole sample, by pink circles for the selected sub-sample, and by green circles for the excluded sub-sample.Error bars represent the associated 95% confidence intervals, and light circles represent individual participants.(b) Participants were sorted in descending order according to their prime discrimination accuracy in odd trials (purple line) or in the first half of the awareness check (orange line) and then iteratively excluded, going from the full sample to the five participants with the lowest awareness scores.(c) This sorting was then applied to even trials (purple line) or to the second half of the awareness check (orange line).Shaded error bars represent 95% confidence intervals.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 6 .
Fig. 6.Equivalence tests and Bayesian statistics to demonstrate absence of awareness.(a, b) Results from the two one-sided tests (TOST) equivalence test procedure as a function of different smallest effect sizes of interest (SESOIs).Equivalence tests based on the two one-sided tests (TOST) procedure can be used to determine whether an observed effect is surprisingly small (p < .05)under the assumption that a true effect, at least as large as a specific SESOI, exists.The plotted p-values show when measured (a) d' scores and (b) accuracy scores, separately for Experiment 1 (subsample) and Experiment 2 (whole sample), were significant given different (arbitrarily chosen) SESOIs.(c, d) Results from Bayesian t-tests as a function of priors (all normal distributions) of different widths (SDs).Bayes factors provide estimates of the probability of the null hypothesis of chance performance (BF 01 ) relative to the alternative hypothesis of performance deviating from chance, given the data.The plotted Bayes factors show when measured (a) d' scores and (b) accuracy scores, separately for Experiment 1 (sub-sample) and Experiment 2 (whole sample), provided support for the null hypothesis, given different (arbitrarily chosen) SDs of the prior distribution.In all plots, y-axis scales are chosen so that values higher up on the y-axes provide more evidence for absence of awareness.

Fig. 7 .
Fig. 7. Direct comparison of priming vs. awareness in Experiment 1.(a) Correlation between raw priming effects in ms and transformed priming effects in d'.Circles represent individual participants (pink circles selected participants based on a binomial test against chance, p > .05;green circles represent excluded participants, p < .05),solid black lines the best-fitting regression lines, and dashed lines the associated 95% confidence intervals.(b-d) Mean priming effects, prime discrimination and difference between priming effects and prime discrimination after transformation of all data to d', separately for the whole sample and the post-hoc selected sub-sample.Error bars represent 95% confidence intervals.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)