Failed Attempts to Improve the Reliability of the Alcohol Visual Probe Task Following Empirical Recommendations

The visual probe task (VPT) is a computerized task used to measure attentional bias to substance-related stimuli. Little research has examined the psychometric properties of the VPT, despite concerns it demonstrates poor test–retest reliability and internal consistency. These issues can reduce confidence in inferences based on VPT performance. As such, we attempted to identify parameters under which the reliability of the alcohol VPT might be improved by applying recent empirical recommendations for outlier handling, bias calculation, and task design from the anxiety literature. We reanalyzed data from 3 previously published studies in our laboratory and 2 newly collected data sets. We compared tasks which presented images on the left/right of the screen to above/below, whether participants responded to the location or content of the probe, and whether general alcohol-related images or images personalized to the individual were used. In each VPT we also applied a priori outlier removal (2 and 3 standard deviations and median absolute difference) and data-driven outlier removal (winsorizing), in addition to calculating trial-level bias scores. Across all studies and tasks internal consistency and test–retest reliability of attentional bias measures were inadequate. There was no consistent improvement in internal consistency or test–retest reliability as a function of outlier removal methods. We were unable to demonstrate adequate reliability of the alcohol VPT, which further supports observations that these tasks may not yield reliable measures. Future research should focus on improving the reliability of these tasks or abandoning them in favor of more reliable alternatives.

Several theoretical models of addiction suggest that individuals who drink alcohol demonstrate preferential attention to alcohol-related cues in their environment, at the expense of competing cues (Franken, 2003;Robinson & Berridge, 2001). This preferential attention is often referred to as an "attentional bias." Meta-analyses have demonstrated a small but robust link between attentional bias and craving (Field, Munafò, & Franken, 2009), and experimental manipulations of attentional bias have directly influenced alcohol consumption/relapse (Field & Eastwood, 2005;Schoenmakers et al., 2010) and craving (Luehring-Jones, Louis, Dennis-Tiwary, & Erblich, 2017) suggesting a possible causal relationship. However, more recently the clinical relevance of attentional bias has been challenged, with suggestions that weak findings are often overinterpreted and "null" findings ignored (Christiansen, Schoenmakers, & Field, 2015). Despite concerns, researchers continue to devote considerable effort to elucidating the exact role of attentional bias in addiction (and related behaviors such as obesity; Werthmann, Jansen, & Roefs, 2015).
One of the most popular tools used to measure attentional bias is the visual probe task (VPT, also known as the dot-probe task), first developed by MacLeod and colleagues (MacLeod, Mathews, & Tata, 1986). This task presents a pair of images: one alcoholrelated and one control image (often a neutral or soft-drink image matched for composition and complexity). These images typically appear on the left-and right-hand side of the computer screen. Following a defined period, usually between 200 -2,000 ms, these images disappear and a target probe appears in the spatial location previously occupied by one of these images. Participants have to make a response to the location or content of the probe as quickly as possible. If participants are faster to respond to the probes occurring in the space previously occupied by alcohol-related cues compared with a control cues, this is inferred as an attentional bias toward alcohol.
Despite widespread use and acceptance, 1 there is much debate with regard to the reliability of the VPT for substances of abuse. Ataya et al. (2012) examined the internal reliability of several VPTs for alcohol and tobacco conducted in their laboratory and concluded internal consistency was poor (␣ ϭ .00 to .50; mean .18). This supports more recent claims that the internal consistency of measures of cognitive biases are suboptimal and underreported (Parsons, 2018a). In response, we  argued that the poor reliability may be due to specific features of the VPT, one of which was type of stimuli used in the task. Most studies provide a broad category of alcohol-related cues, however, these images may not represent the typical drinking habits of participants. For example, participants may identify as beer drinkers only, however during a VPT task they would be presented with stimuli depicting a broad range of alcoholic beverages (beer, wine, cider, spirits, etc.). To examine this, we (Christiansen, Mansfield, Duckworth, Field, & Jones, 2015) tailored a VPT to present only pictures that depicted the participants' preferred drink category (e.g., beer-related cues) and demonstrated improved attentional bias compared with a more general category (␣ ϭ .73 compared with ␣ ϭ .19). We also demonstrated that directly measuring attentional biases using eye-tracking technology increased internal consistency further for personalized images (␣ ϭ .73), but also general images (␣ ϭ .51).
As well as internal reliability, test-retest reliability (the consistency of a measure over time) is necessary for valid inferences from psychological tasks. This may be particularly important when attentional bias is measured repeatedly within individuals: for example, in the case of assessing changes in attentional bias that should arise after attentional bias modification interventions. Emery and Simons (2015) demonstrated that the VPT had poor split-half (r ϭ Ϫ.19) and test-retest reliability (r ϭ .13). Similarly in cocaine-using adults (Marks, Pike, Stoops, & Rush, 2014), test-retest reliability is low for reaction time (RT)-based measures (r ϭ .24), but improved if examining eye movements (r ϭ .51). Poor internal consistency and reliability threaten the validity of inferences that can be made using the VPT (Rodebaugh et al., 2016), and a failure to consider reliability might contribute to poor estimations of effect size and challenges to reproducibility (Parsons, 2018b;Zimmerman & Zumbo, 2015). Therefore, continued efforts need to be made to improve the psychometric properties of these tasks.
A recent paper attempted to provide empirical recommendations to improve the reliability of the VPT for anxiety-related images. Price and colleagues (2015) suggest that poor reliability of VPT may be due (in part) to how outlying RTs are handled when preparing the data for analyses. Typical procedures involve decisions based on cutoffs based on a valid response window for the population (e.g., RTs faster than 200 ms represent premature responding and slower than 2,000 ms suggest distraction), followed by removal of RTs which fall outside the distribution of the individual's mean (e.g., 2 or 3 SDs). Research suggests that despite these techniques being the most popular method of removing outliers, they do not perform well under certain conditions (Leys, Ley, Klein, Bernard, & Licata, 2013) and there is little consensus across studies (cf., differing procedures are reported in each of these studies using alcohol VPT; Field & Powell, 2007;Miller & Fillmore, 2010;Townshend & Duka, 2001). Price et al. (2015) compared the reliability of bias scores following these outlier removal techniques with data-driven outlier removal in which outliers which fall outside of the observed distribution were rescaled (winsorized; Erceg-Hurn & Mirosevich, 2008). This procedure reduces the impact of outliers but also maintains all data points, increasing power. A further difference was how images were presented, in that standard alcohol VPT images are often presented on the left and right side of the screen, whereas Price et al. (2015) presented them at the top and bottom of the screen. They examined the effect procedural variables (probe location) may have on RT variance by examining the reliability on trials in which the probe only occurred in one position separately (e.g., bottom). Finally, they examined the reliability of bias scores averaged over tasks (given approximately 2 weeks apart), as an increased number of measures should increase reliability. To summarize, they found that test-retest reliability was greatest when (a) bias scores were calculated for probes that occurred behind the bottom image only, (b) winsorized outlier removal was used rather than arbitrary a priori cutoffs, and (c) data from repeated VPT were used, rather than a single task. The main focus of Price et al.'s (2015) investigation was the stability (test-retest) and internal consistency of attentional bias, unfortunately they did not consider internal consistency within the task(s) by examining bias scores on a picturepair basis.
A second limitation of current data analytic techniques is the underlying assumption that attentional bias is a stable construct. This assumption is problematic because attentional bias may fluctuate within individuals during the course of the task (Zvielli, Amir, Goldstein, & Bernstein, 2016;Zvielli, Bernstein, & Koster, 2015, Iacoviello et al., 2014. For example, in deprived smokers attentional biases were evident only in phasic bursts within the VPT but were not evident when using the traditional overall ("global") bias score. As such, we also calculated estimates of trial-level bias scores (TL-BS), based on recommendations by Zvielli et al., 2015).
Therefore, the aim of the current article was to apply the empirical recommendations of Price et al. (2015) and Zvielli et al. (2015), and the use of personalized stimuli (Christiansen, Mansfield, et al., 2015) to the alcohol VPT in order to examine whether these procedural and analytical changes led to improvements in internal consistency (within both image pairs and tasks) and testretest reliability. We also examined cross-sectional associations of attentional bias with alcohol consumption and craving. We examined these associations in social drinkers as these individuals also experience craving, and a previous meta-analysis (Field, Munafò, et al., 2009) has demonstrated a link between attentional bias and craving irrespective of drinking status. First, we reanalyzed existing data from three published studies Field, Duka, et al., 2009;Schoenmakers, Wiers, & Field, 2008) to provide internal consistency estimates (not previously reported) and examine whether different outlier cutoffs influenced these estimates. Then, in Study 1 we examined the internal consistency and test-retest reliability of a standard VPT and VPT recommended by Price et al. (2015) using general alcohol-related cues. We hypoth-esized that internal consistency estimates would be greater for the recommended task compared with the standard task. In Study 2 we examined the internal consistency and test-retest reliability of the recommended VPT, with general and personalized alcohol-related cues and concurrent eye tracking. We hypothesized that personalized cues would lead to greater internal consistency estimates than general cues, and internal consistency would be further improved by eye-tracking. In each study we also hypothesized that attentional bias measures computed from winsorized RTs and bottomonly probe trials would provide greatest internal consistency.

Method Data Reduction and Analyses
For the outlier removed in all studies (preexisting and new) we conducted three different procedures. For the 2 SD procedure we removed all individual RTs that were faster than 200 ms and slower than 2,000 ms and then 2 SDs above or below the individual mean. For the 3 SD procedure we removed all RTs Ͻ200 and Ͼ2,000 and then those that were 3 SDs above or below the mean. For winsorized outlier removal we rescaled values outside of 1.5 interquartile ranges from the Tukey hinges (25th and 75th percentile) of the full RT distribution of all individuals to the last valid value (Price et al., 2015). We also conducted the median absolute deviation (MAD) method of outlier removal (Leys et al., 2013). The MAD method involves calculation of the median value of the individual's RT distribution and subtracting this from each RT to create a series of absolute values; the median of these values is then multiplied by 1.4826 to calculate the MAD. The MAD was then multiplied by a value of 3. Upper median and lower cutoffs [median Ϯ (MAD ϫ 3)] are then computed and removed. Note that we did not preregister our decision to include MAD as an outlier removal technique for our new data. Attentional bias scores were created for each picture pair by computing mean RTs on each trial type (congruent and incongruent) then subtracting congruent from incongruent RTs (mean incongruent -mean congruent ), so that larger positive scores were indicative of increased attentional bias.
We also computed TL-BS by matching temporally contiguous pairs of congruous and incongruous trials within the VPT for each subject ["RT 1st Incongruent Trial -RT 1st Congruent Trial," "RT 2nd Incongruent Trial -RT 2nd Congruent Trial," and so on]. We conducted TL-BS on winsorized data without removing any RTs more than five trials apart, to ensure the larger number of trials were available for our reliability estimates. This provided us with a maximum of 64 individual bias scores. From these individual bias scores we calculated mean TL-BS positive (mean of all bias scores Ͼ0 ms per participant), mean TL-BS negative (mean of all bias scores Ͻ0 ms per participant), peak TL-BS positive (largest individual bias score Ͼ0 ms), peak TL-BS negative (largest individual bias score Ͻ0 ms), and TL-BS variability (the sum of distances between all individual bias scores/number of scores; Zvielli et al., 2015). 2 For internal consistency estimates we computed McDonald's because Cronbach's alpha often underestimates internal consistency (Sijtsma, 2009), and many have argued for its use be abandoned (Peters, 2014). For test-retest reliability we computed the intraclass correlation coefficient (ICC) using a two-way random effects model with absolute agreement. In line with Price et al. (2015) we report the single measurement which is an indicator of the reliability if only one assessment point was used, and also the combined measure which reflects the internal consistency of bias scores across the time points. We also reported Pearson's correlation between the two time points (a more common measure of test-retest reliability), to allow direct comparisons with previously published studies in this area (e.g., Emery & Simons, 2015;Marks et al., 2014). Across each study we used the total bias score with greatest internal consistency to assess cross sectional associations with individual differences in alcohol consumption and craving. Finally we used the cocran r-package when making comparisons based on our internal consistency estimates (Diedenhofen & Musch, 2016).

Analyzing Internal Consistencies of Preexisting Data
To examine the internal consistency of the VPT we reanalyzed the data from three studies published by our laboratory. Two studies examined attentional bias to alcohol-related Schoenmakers et al., 2008) and one to smoking-related cues (Field, Duka, et al., 2009). Schoenmakers et al. (2008) examined attentional bias following ingestion of a placebo beverage and an alcoholic beverage in heavy drinkers, and we provide internal consistency estimates for both conditions (this also allows comparisons with Ataya et al. (2012) who reported estimates from alcohol "priming" studies, which were also low Ͻ.34).  examined attentional bias before and after attentional bias modification in heavy drinkers: here we provide reliability estimates for the baseline session only as it is reasonable to assume attentional bias modification may influence reliability estimates. Finally, Field, Duka, et al. (2009) examined attentional bias to smoking cues before and after attentional bias modification: again, we examined internal consistencies at baseline only. We decided to include estimates of smoking-related internal consistency as Field and Christiansen (2012) demonstrated internal consistencies to smoking-related images should be greater due to more homogenous images, compared with alcohol-related cues. As discussed, there was considerable variability in the task parameters which allowed us to examine whether internal consistency was greater with a larger number of images (30; , using different stimulus presentation durations (500 ms vs. 50 ms; Field, Duka, et al., 2009) and when intoxicated (Schoenmakers et al., 2008).

Overview of Findings From Preexisting Data
Findings for internal consistency for each study, using different outlier removal techniques are presented in Table 1. To summarize, across the three studies internal consistency estimates did not reach acceptable levels (Ͼ.70). Estimates were greater when a larger number of picture pairs were used (30 picture pairs). To further investigate this we also examined internal consistencies from the same data sets when randomly selecting eight and 14 2 Note, we did not preregister TL-BS analyses. These were recommended by a helpful reviewer during peer review. We were unable to calculate internal consistency estimates for TL-BS scores due to the large number of trials that are removed when using Ͻ0 ms and Ͼ0 ms as required.
picture pairs, we chose eight and 14 to make direct comparisons with the data in Studies 1 and 2 below (8 picture pairs) and Christiansen, Mansfield, et al. (2015;14 picture pairs). Estimates were also larger for longer stimulus presentations (500 ms vs. 50 ms). We did not observe evidence that estimates were greater for smoking-related images compared with alcohol-related images.
We note that our estimates of trial-level attentional bias scores (Table 2) are consistent with previous observations from tobacco smokers that attentional bias is not a stable construct and considerable variability in bias scores occurs within the task. Furthermore, mean positive bias scores were generally a greater distance from 0 ms than mean negative bias scores suggesting the presence of attentional bias within the task, but this bias may be obscured if one relies on conventional attentional bias scores.
Finally, there was no evidence that any outlier removal technique led to improved internal consistencies across the studies. We examined whether overall attentional bias was present in each study, using the most reliable outlier removal technique (see Table  1). The presence of a positive bias toward substance cues is inconsistently seen. There were significant biases toward smoking cues irrespective of stimulus presentation duration and following alcohol and placebo intoxication, but only when a smaller number of images were used to calculate reliability and bias estimates.
Therefore, to briefly summarize, our reanalyses of existing data suggests that the internal consistency of the VPT for alcohol-and smoking-related cues is inadequate, despite differing task parameters. These findings support observations by Ataya et al. (2012), who also reported poor internal consistency estimates in VPTs used by their laboratory. Below, we report on two new studies which aimed to include personalized stimuli and different variations of the VPT task based on Price et al. (2015). The design, hypotheses, statistical power justification and analyses were preregistered on Open Science Framework prior to data collection (https://osf.io/gb5fz/). 3

Current Data
Participants. Participants in each study were recruited from the University of Liverpool and local community. In order to take part, participants had to drink alcohol on a regular basis (at least once per week). Participants were excluded if they had a current or previous diagnosis of a substance use disorder, due to ethical considerations (exposure to substance-related cues could evoke craving, which could be problematic in this population). And because our primary interest was the reliability of these tasks in participants without substance use disorder. The studies were approved by the local ethics committee at the University of Liverpool.
Questionnaires Timeline Follow-Back (TLFB). Participants completed 1-week retrospective recalls of their alcohol consumption in United Kingdom units (1 unit ϭ 8 g pure alcohol), on a day-by-day basis. They were provided with an easy-to-follow guide of typical alcoholic drinks and their units, to ensure accurate estimations. The TLFB (Sobell & Sobell, 1992) is considered to be reliable over short periods and demonstrates considerable stability over time (Carey, Carey, Maisto, & Henson, 2004).
Approach and Avoidance of Alcohol Questionnaire (AAAQ). The AAAQ (McEvoy, Stritzke, French, Lang, & Ketterman, 2004) is a self-report measure of craving, using a 14-item scale. It has three subscales: Inclined/Indulgent, Obsessed/Compelled, and Resolved/Regulated. It has good psychometric properties, however studies have suggested a two-factor structure of approach and avoidance dimensions (Klein et al., 2007).
VPT(s). We based the VPTs on those presented in Price et al. (2015). In the standard version of the task the picture pairs were presented on the left and right of the screen followed by a probe (the letter "E" or "F") and participants had to respond to the location of the probe. In the recommended version the picture pairs were presented at the top and bottom of the screen followed by the probe, and, in this case, participants had to respond to the content of the probe (e.g., press the E key if they saw E, press the F key if they saw F). These is an important distinction between respond- ing to the content rather than location of the probe, simply responding to the location can be interpreted as perceiving the cue on the left or not perceiving the cue on the right, for example. Therefore, responding to the content of the cue should overcome this issue and presumably lead to more reliable bias estimates. We note that the majority of studies now use VPTs which require responding to content rather than location, so this may no longer be "standard," however, our aim was to compare a task and outlier removal techniques which are empirically recommended to those which have been used previously.
In both tasks trials began with the presentation of a fixation cross (ϩ) for 500 ms. In the standard version this appeared in the direct center of the screen, whereas in the recommended version this appeared in the space at the top of the screen (where the top image would be presented). Following this, the picture pairs would be presented for 500 ms, these images would then be removed from display, and immediately followed by presentation of the probe until a response was made. Each task had 10 (control-control) practice trials, followed by 160 trials, of which each alcohol-control picture pair was presented 128 times, and control-control picture pairs were presented 32 times. The probe appeared with equal frequency in place of the alcohol and control images in the alcohol-control pairings, and an equal number of times on the left and right/top and bottom depending on the task. Each task took approximately 10 min to complete. Note, there is considerable heterogeneity in previously published studies using the VPT for alcohol; for example, our previous assessment of reliability did not include controlcontrol images (Christiansen, Mansfield, et al., 2015), and had fewer trials (68) but a larger number of picture pairs (14). Other studies have used a larger number of trials (252; (Emery & Simons, 2015), included control-control comparison (Field, Mogg, Zetteler, & Bradley, 2004), and varied stimulus presentation durations (Field et al., 2004). As such, there is no agreed protocol for assessing attentional bias using the VPT.
Images. Each task had eight alcohol-related and control picture pairs. General alcohol images were taken from our previous studies (Field, Mogg, Mann, Bennett, & Bradley, 2013;Jones et al., 2012) and depicted images such as of a model holding a bottle of beer or a pen to their lips, or a stack of beer crates or books. For the personalized images we used a selection of the images from Christiansen, Mansfield, et al. (2015). We used different control images for the control-control comparisons to prevent habituation to the images. All images were 140 mm ϫ 90 mm. Distance between images was 75 mm in the recommended task and 95 mm in the standard task.

Procedure
Participants attended the laboratory and provided informed consent before completing the TLFB and AAAQ. They then completed the standard VPT and recommended VPT, the order of which was counterbalanced across participants. Following completion of the tasks participants left the laboratory and returned between 7 and 14 days later. Upon their return they completed a second TLFB, AAAQ, standard, and recommended VPT (presentation of VPTs was counterbalanced across time and participants) before being thanked and debriefed. Each session lasted approximately 25 min and participants were given course credits. In the standard task of Study 1 we analyzed RTs for the probe occurring on the left side only, to provide a comparison with bottom-only trials in the recommended version.

Results
Internal consistency and test-retest reliability. Internal consistency and test-retest reliability estimates of alcohol attentional bias scores are shown in Table 3. Across both tasks (standard vs. recommended), procedural variables (below only vs. above and below/left only vs. left and right) and outlier estimation technique (2 SD vs. 3 SD vs. winsorized) the internal consistency was poor. No estimate approached the threshold for acceptable internal consistency (.70), across time points. However, combined winsorized data from probes appearing behind the bottom image in the recommended task had the greatest internal consistency, but this was not significantly greater than the second greatest (winsorized above and below: t(55) ϭ 0.046, p ϭ .96). Within-subject variability. Mean and peak TL-BS measures and ICC estimates are displayed in Table 4. Mean measures (both positive and negative) offered improved test-retest reliability than peak estimates and variability, with negative TL-BS mean scores providing the greatest reliability (ICC ϭ .434). Reliability estimates from the recommended task were generally superior to those from the standard version. Furthermore, the estimates for negative mean TL-BS scores were greater than estimates from global bias scores, irrespective of outlier removal strategy.
Associations between trial-level biases and alcohol consumption/craving. Mean negative bias scores on the recommended task had the greatest test-retest reliability. At Time 1 there was no significant association with units consumed (r ϭ .168, p ϭ .173). There were significant associations with both inclined (r ϭ .358, p ϭ .003) and obsessed subscales (r ϭ .250, p ϭ .041), but no significant association with the avoidant subscale (r ϭ .220, p ϭ .074). At Time 2 there were no significant associations with units consumed (r ϭ .001, p ϭ .992) or craving subscales (rs Ͻ Ϫ.089, ps Ͼ .520).

Procedure
Participants attended the laboratory and provided informed consent before completing the TLFB and AAAQ. They then reported their preferred drink out of beer, wine, cider, or vodka (Christiansen, Mansfield, et al., 2015). They then completed a recommended VPT with general alcohol images and a task with alcohol images personalized to their preferred drink (counterbalanced), with concurrent eye tracking. There was no overlap between the general and personalized image sets, to reduce the possibility participants habituated to the images across sessions. Following this they left the laboratory, and returned between 7 and 14 days later. Upon their return they completed a second TLFB, AAAQ, and two VPTs (general stimuli and personalized stimuli, counterbalanced), with concurrent eye tracking. Each session lasted approximately 25 min and individuals were given course credits for their participation. Eye movements were measured using the ASL D6 (Advanced Science Laboratories, Bedford, Massachusetts) eye tracker continuously recording data at 120 Hz.

Data Reduction and Analysis for Eye Movements
We computed gaze dwell time as the total amount of time (ms) that participants fixated on images, with a fixation defined as a stable eye movement within 1°of visual angle for 100 ms or longer (see previous studies: Jones et al., 2012;Christiansen, Mansfield, et al. (2015)). Bias scores were calculated by subtracting gaze dwell times on neutral images from alcohol images separately for each picture pair.

Internal consistency and test-retest reliability of RT data.
Internal consistency and test-retest reliability estimates of alcohol attentional bias scores are shown in Table 5. Across both stimulus sets (personalized vs. general), procedural variables (below only vs. above and below), and outlier estimation technique (2 SD vs. 3 SD vs. winsorized) the internal consistency was poor. Data from personalized cues with 2 SD outliers and above and below probes approached the threshold for acceptable internal consistency (.63), however this was only at Time 1 and was not significantly greater than the second greatest (3 SD above and below: t(44) ϭ 1.289, p ϭ .204).
Within-subject variability. For TL-BS estimates, see Table  4. As in Study 1, mean estimates had greater test-retest reliability than peak estimates. The estimates which provided the greatest test-retest reliability were from the negative mean bias score using personalized images (ICC ϭ .446). As in Study 1, this estimate was superior to the estimate obtained from global bias measures, irrespective of outlier removal techniques.
Internal consistency and test-retest reliability of eyemovement data. Internal consistency and test-retest reliability estimates of alcohol attentional bias using eye movements are shown in Table 6. As with RT data the internal consistency estimates were poor; the greatest estimates came from personalized cues using all trials (.570), but this still fell short of the cutoff for acceptability. Test-retest reliability was also poor with general alcohol bias demonstrating the greatest reliability.
Associations between trial-level biases and alcohol consumption/craving. Mean negative bias scores to personalized cues had the greatest test-retest reliability. At Time 1 there was no significant association between mean negative bias and units consumed (r ϭ .058, p ϭ .701). There were significant associations with both inclined (r ϭ .467, p ϭ .001) and obsessed subscales (r ϭ .345, p ϭ .019), but no significant association with the avoidant subscale (r ϭ .212, p ϭ .157). At Time 2 there were no significant associations with units consumed (r ϭ .181, p ϭ .251) or craving subscales (rs Ͻ .199, ps Ͼ .207).
Attentional bias was not significantly associated with units consumed at Time 1 (r ϭ .201, p ϭ .180) or craving subscales (rs Ͻ .204, ps Ͼ .174). Similarly, there was no significant association between attentional bias and units consumed (r ϭ .152, p ϭ .350) at Time 2. However, there was a significant positive association with inclined (r ϭ .340, p ϭ .032) and obsessed subscales (r ϭ .426, p ϭ .006) at Time 2. There was no significant association with the avoidant subscale (r ϭ .054, p ϭ .742).

Discussion
The aim of this series of studies was to attempt to improve the internal consistency and test-retest reliability of the alcohol/smoking VPT by using recently published empirical recommendations. First, we observed that estimates of internal consistency of VPTs  in previously published studies were less than acceptable irrespective of outlier removal techniques. Furthermore, we demonstrated limited support for empirical recommendations in improving psychometric properties of the VPT across all studies, as both internal consistency and test-retest reliabilities were consistently poor. Our findings contribute to the growing body of evidence which suggests that assessing attentional bias to alcohol (and smoking) using the VPT is unreliable (Ataya et al., 2012). However, these observations are not limited to substance-related cues. Chapman, Devue, and Grimshaw (2017) reviewed internal consistencies across a number of studies examining threatening images, pain-related images, and fearful faces and demonstrated split-half reliabilities ranging from Ϫ.22 to .59. Furthermore, they demonstrated reliabilities were only acceptable when cues were presented for short periods (100 ms) suggesting longer time periods such as those regularly used here (500 ms) and in the wider addiction literature (500 -2,000 ms) allow attention to be disengaged and reallocated before a probe appears. These findings were corroborated by Waechter, Nelson, Wright, Hyatt, and Oakman (2014), however, they demonstrated direct measures of attention (eye movements) had excellent reliability at longer stimulus durations (5,000 ms).
While personalized stimuli led to greater internal consistency in Study 2, we were unable to replicate previous findings which have demonstrated that using alcohol-related cues based on an individual's preferred drink improves the internal consistency of the VPT to acceptable levels (Christiansen, Mansfield, et al., 2015). We can speculate as to why we did not replicate these findings. It is possible that the larger number of alcohol-neutral picture pairs (14 vs. 8) in Christiansen, Mansfield, et al. (2015) increased the internal reliability estimate as Cronbach's alpha has been demonstrated to increase as a function of items in the scale (Tavakol & Dennick, 2011); indeed, we also noted that in  reliabilities were close to the acceptable threshold with 30 images (see Table 1) and this declined when a lower number of images was used to estimate internal consistency. Nevertheless, to directly compare across studies we took alpha (.73) from personalized cues from Christiansen, Mansfield, et al. (2015) and used alpha for Study 2 (.61), personalized cues using 2 SD outlier removal, and demonstrated no significant difference between the two, F(59, 45) ϭ 1.44, p ϭ .200.
We also demonstrated poor test-retest reliability across time points in all studies, image type,s and outlier removal, with the greatest reliability demonstrated using trial-level estimates. These findings are similar to those assessing test-retest of cocaine attentional bias (Marks et al., 2014), and anxiety-related words/pictures (Price et al., 2015). While these findings might be attributable to measurement inadequacy, it is also possible that attentional bias demonstrates low stability/state dependence (Hedge, Powell, & Sumner, 2018). Recent theoretical models suggest that attentional bias is sensitive to immediate momentary evaluations, which, in turn, is sensitive to a myriad of internal and environmental factors  which may differ across testing sessions. These observations are supported by Zvielli et al., (2015) who demonstrated phasic bursts of attentional bias within the task, and is supported by our TL-BS which demonstrated improved (but still not acceptable) test-retest reliability in Studies 1 and 2.
We found limited evidence of significant associations between attentional bias and alcohol consumption or craving, unlike previous studies and meta-analyses (Field, Munafò, et al., 2009;Marks et al., 2014). One explanation for this is that these associations exist but are obscured by poor psychometric properties of the VPT (Rodebaugh et al., 2016). In support of this, Christiansen and Bloor (2014) demonstrated that personalized cues were predictive of alcohol use but general cues were not using the Stroop task in social drinkers (cf., equivocal findings in dependent drinkers; Fridrici et al., (2013), and in Study 2 we demonstrated tentative evidence of positive associations with craving when using global bias measures. Trial-level bias estimations also demonstrated that as attentional avoidance of alcohol cues (increased negative bias scores) increased in strength, subjective craving reduced in strength. However it is reasonably likely that findings throughout the literature are overstated or there is no meaningful relationship (Christiansen, Schoenmakers, et al., 2015). Christiansen, Mansfield, et al. (2015) did not find any associations with alcohol use/craving when internal consistency was greater than the acceptable threshold (see also Waechter et al., (2014) in social anxiety). Furthermore, a lack of standardized protocol for the VPT allows for researcher degrees of freedom which may artificially inflate associations through "significance chasing" (Ware & Munafò, 2015).
The major implication of these findings is that the poor reliability of the VPT was consistently evident despite numerous attempts at stimuli, analyses, and protocol changes aimed at improving the reliability. Given the widespread use of the VPT in the literature this may have wide-reaching consequences and it is probable that that the VPT is not reliable in nonclinical populations and should not be used as a diagnostic tool (Schmukle, 2005). Furthermore, McNally (2018) suggests that the lack of reliability of the VPT and other attentional measurements is an emerging crisis which may threaten survival of the field. Therefore, a focus on improving reliability is urgently needed to help accurately test theoretical predictions of addiction models (Franken, 2003;Robinson & Berridge, 2001), but also whether attentional bias modification using modified VPTs can lead to robust clinically relevant outcomes (Cox, Fadardi, Intriligator, & Klinger, 2014). Until we can develop robust, reliable RT measures of attentional bias future research should focus on measuring direct attention wherever possible using eye-tracking technology, as this has been demonstrated to show greatest levels of internal consistency and test-retest reliability in other studies (Christiansen, Mansfield, et al., 2015;Waechter et al., 2014). Eye tracking may provide more reliable measures as it is not dependent on manual RTs which are distally related to attentional capture, and can be confounded by intervening emotional processes and response execution (Armstrong & Olatunji, 2012). Furthermore, it also allows for researchers to distinguish different stages of attention (early vs. late) as well as other potentially useful measures, such as latency and direction of initial fixation (Hardman, Scott, Field, & Jones, 2014).
There are limitations to our studies. First, we did not specifically recruit heavy drinkers. It may be that reliability of the VPT will be greater in heavy drinkers and alcohol-dependent patients who are thought to demonstrate more robust attentional bias (Schoenmakers et al., 2010;Townshend & Duka, 2001). However, we note that the average alcohol consumption in our studies suggests the majority of our samples were heavy drinkers. Furthermore, the averages are comparable to the data sets we reanalyzed from  and Schoenmakers et al. (2008) who specifically recruited heavy drinkers but did not have acceptable internal consistency. In relation to this, we found limited evidence for the presence of bias using global measures in our newly collected data, and mixed evidence in previous data sets. However, we note that mean positive bias scores were greater than mean negative bias scores using trial-level data across the data sets suggesting biases may exist at periods during the task, but this may be obscured when examining global bias scores. Indeed, similar studies have failed to detect a global bias in nondependent drinkers (Groefsema, Engels, Kuntsche, Smit, & Luijten, 2016;Manchery, Yarmush, Luehring-Jones, & Erblich, 2017). Therefore, future research should further examine the utility of trial-level biases (however, others have suggested limited potential for these indices; Kruijt, Field, & Fox, 2016). Similarly, the absence of overall bias in Studies 1 and 2 may also be attributable to the (lack of) reliability of the VPT to robustly detect these biases rather than an absence in the current samples. Second, we are unable to provide any estimates for alcohol-dependent patients and future research should establish the reliability of the VPT in these samples.
To conclude, in a series of studies we attempted to improve the internal consistency and test-retest reliability of the VPT task for alcohol (and smoking related) using previously published recommendations. Across five data sets (3 preexisting and 2 novel) we did not find adequate internal consistency or test-retest reliability, adding to concerns that the VPT is an unreliable measure of attentional bias for substance-related stimuli.