Time preferences are reliable across time-horizons and verbal vs. experiential tasks

Individual differences in delay-discounting correlate with important real world outcomes, e.g. education, income, drug use, & criminality. As such, delay-discounting has been extensively studied by economists, psychologists and neuroscientists to reveal its behavioral and biological mechanisms in both human and non-human animal models. However, two major methodological differences hinder comparing results across species. Human studies present long time-horizon options verbally, whereas animal studies employ experiential cues and short delays. To bridge these divides, we developed a novel language-free experiential task inspired by animal decision-making studies. We find that subjects’ time-preferences are reliable across both verbal/experiential differences and also second/day differences. When we examined whether discount factors shifted or scaled across the tasks, we found a surprisingly strong effect of temporal context. Taken together, this indicates that subjects have a stable, but context-dependent, time-preference that can be reliably assessed using different methods; thereby, providing a foundation to bridge studies of time-preferences across species.

Intertemporal choices involve a trade-off between a larger outcome received later and a smaller outcome 2 received sooner. Many individual decisions have this temporal structure, such as whether to purchase a 3 cheaper refrigerator, but forgo the ongoing energy savings. Since research has found that intertemporal 4 preferences are predictive of a wide variety of important life outcomes, ranging from SAT scores, graduating 5 from college, and income to anti-social behaviors, e.g. gambling or drug abuse [1,13,24,28,45], they are 6 frequently studied in both humans and animals across multiple disciplines, including marketing, economics, 7 psychology, and neuroscience. 8 A potential obstacle to understanding the biological basis of intertemporal decision-making is that human 9 studies differ from non-human animal studies in two important ways: long versus short time-horizons and 10 choices that are made based on verbal versus non-verbal (i.e. "experiential") stimuli. In animal studies, the 11 subjects experience the delay between their choice and the reward (sometimes cued with a ramping sound or a 12 diminishing visual stimulus) before they can proceed to the next trial [8,11,73]. Generally, there is nothing for 13 the subject to do during this waiting period. In human studies, subjects usually make a series of choices  Figure 1. A: A novel language-free intertemporal choice task. This is an example sequence of screens that subjects viewed in one trial of the non-verbal task. First, the subject initiates the trial by pressing on the white-bordered circle. During fixation the subject must keep the cursor inside the white circle. The subject hears an amplitude modulated pure tone (the tone frequency is mapped to reward magnitude and the modulation rate is mapped to the delay of the later option). The subject next makes a decision between the sooner (blue circle) and later (yellow circle) options. If the later option is chosen, the subject waits until the delay time finishes -which is indicated by the colored portion of the clock image. Finally, the subject clicks in the middle bottom circle ("reward port") to retrieve their reward. The reward is presented as a stack of coins of a specific size and a coin drop sound accompanies the presentation. B: Stimuli examples in the verbal experiment during decision stage (the bottom row of circles is cropped). C: Timeline of experimental sessions.
trial. At the end of the session, a single long-verbal trial was selected randomly to determine the payment. If 68 the selected trial corresponded to a subject having chosen the later option, she received her reward via an 69 electronic transfer after the delay. 70 Subjects' time-preferences are reliable across both verbal/experiential and 71 second/day differences 72 Subjects' impulsivity was estimated by fitting their choices with a hierarchical Bayesian model of hyperbolic 73 discounting with decision noise (Materials and Methods). The model (M 6p,4s ) had 6 population level 74 parameters (discount factor, k, and decision noise, τ , for each of the three tasks) and 4 parameters per subject: 75 k N V ,k SV ,k LV and τ . The subject level effects are drawn from a normal distribution with mean zero. Subjects' 76 choices were well-fit by the model (Fig. 2 & Fig. S1). Since we did not ex ante have a strong hypothesis about 77 how the subjects' impulsivity measures in one task would translate across tasks, we first examined ranks of 78 impulsivity and found significant correlations across experimental tasks (Table 1). In other words, the most 79 impulsive subject in one task is likely to be the most impulsive subject in another task. This result is robust to 80 different functional forms of discounting and estimation methods ( Fig. S2 & Table S3). For example, if we  Table 1). We found that k, for all tasks, had a log-normal distribution across our subjects (as 85  [69]) and shown in Fig. 3C), hence we present our results in log(k). 86 Consistent with existing research, we find that time-preferences are stable in the same task within subjects 87 between the first half of the block and the second half of the block within sessions and also across experimental 88 sessions that take place every two weeks (SI Results) [4,50]. In our verbal experimental sessions the short and 89 long tasks were alternated and the order was counter-balanced across subjects. We did not find any order   Figure 2. A 50% median split (± 1 standard deviation) of the softmax-hyperbolic fits for more patient (A-C) and less patient (D-F) subjects. The values of k and τ are the means within each group (decision noise τ decreases significantly from non-verbal to verbal tasks, bootstrapped mean test, p < 10 −4 ). Average psychometric curves obtained from the model fits (lines) versus actual data (circles with error bars) for NV, SV and LV tasks for each delay value, where the x-axis is the reward magnitude and the y-axis is the probability (or proportion for actual choices) of later choice. Error bars are binomial 95% confidence intervals. We excluded the error in the model for visualization.
In our experimental design, the SV task has shared features with both the NV and LV task. First, the SV 93 shares time-horizon with the NV task. Second, the SV and LV are both verbal and were undertaken at the 94 same time. The NV and LV tasks differ in both time-horizon and verbal/non-verbal. The only potential 95 feature that is shared between all tasks is delay-discounting. To test whether the correlation between NV and 96 LV might be accounted for by their shared correlation with the SV task, we performed linear regressions of the 97 discount factors in each task as a function of the other tasks (e.g. 98 log(k N V ) = β SV log(k SV ) + β LV log(k LV ) + β 0 + )). For N V the two predictors explained 63% of the variance 99 (F (60, 2) = 50.63, p < 10 −9 ). It was found that log(k SV ) significantly predicted log(k N V ) (β SV = 1.28 ± 0. 15 Figure 3. A, B: Each circle is one subject (N=63). The logs of delay-discounting coefficients in SV task (x-axis) plotted against the logs of delay-discounting coefficients in NV (A) and LV (B) tasks (y-axis). The color of the circles and the colorbar identify the ranks in NV task. Pearson's r is reported on the figure. The error bars are the SD of the estimated coefficients (posterior means). Solid line is the linear fit. Shaded area is the 95% CI of the linear fit. Dashed line is unity. C: Distribution of posterior parameter estimates (log(k)) from the model fit for the three tasks in control experiment 1. Probability density estimates were obtained (Materials and Methods)) for posteriors of log(k) for experimental tasks (k N V ∼ 1/sec, k SV ∼ 1/sec, k LV ∼ 1/day). Comparisons between tasks are reported in Table 4. Note, the units for k SV & k N V (1/sec) would need to be scaled by (86400secs/day) to be directly compared to k LV . p < 10 −9 ) but log(k LV ) did not (β LV = −0.12 ± 0.09, p = 0.181). For LV we were able to predict 40% of the 101 variance (F (60, 2) = 19.64, p < 10 −6 ) and found that log(k SV ) significantly predicted log(k LV ) predictors explained 72% of the variance (F (60, 2) = 78.93, p < 10 −9 ). Coefficients for both predictors were 104 significant (β N V = 0.435 ± 0.050, p < 10 −9 ; β LV = −0.223 ± 0.046, p < 10 −5 ); where β = mean ± std.error. 105 We further verified these results by generating 1-predictor reduced models based on the stronger of the 106 2-predictors for each task and comparing the nested models using Akaike Information Criteria (AIC) and 107 likelihood ratio tests (LR test) ( Table 2).

108
In order to test whether the verbal/non-verbal gap or the time-horizons gap accounted for more variation 109 in discounting we used a linear mixed-effects model where we estimated log(k) as a function of the two gaps 110 (as fixed effects) with subject as a random effect (using the lme4 R package [5,6]). We created two predictors: 111 days was false in NV and SV tasks for offers in seconds and was true in the LV task for offers in days; verbal 112 was true for the SV and LV tasks and false for the NV task. We found that time-horizon 113 (β days = −0.524 ± 0.235, p = 0.026) but not verbal/non-verbal (β verbal = −0.317 ± 0.235, p = 0.178) 114 contributed significantly to the variance in log(k). This result was further supported by comparing the 2-factor 115 model with reduced 1-factor models (i.e. that only contained either time or verbal fixed effects). Dropping the 116 days factor significantly decreased the likelihood, but dropping the verbal factor did not (Table 3). and/or scaled between tasks (Materials and Methods). We find that subjects in both SV and NV are more 123 impatient than LV, but not different from each other (i.e. significant shifts between SV and LV, NV and LV, 124 Table 4). There is significant scaling between SV and the other two tasks (Table 4). This is likely driven by 125 subgroups that were exceptionally patient in the LV task ( Fig. 3B) or impulsive in the NV task (Fig. 3A).

117
126 Table 4. Shift and scale of log(k) between tasks Comparison 12.29 * * denotes p < 0.05 one-sided test Controlling for visuo-motor confounds 127 In the main experiment, we held the following features constant across three tasks: the visual display and the 128 use of a mouse to perform the task. However, after observing the strong correlations between the tasks (Fig. 3) 129 we were concerned that the effects could have been driven by the superficial (i.e. visuo-motor) aspects of the 130 tasks. In other words, the visual and response features of the SV and LV tasks may have reminded subjects of 131 the NV task context and nudged them to use a similar strategy across tasks. While this may be interesting in 132 its own right, it would limit the generality of our results. To address this, we ran a control experiment (n=25 133 subjects) where the NV task was identical to the original NV task, but the SV and LV tasks were run in a 134 more traditional way, with a text display and keypress response (control experiment 1, SI Method & Fig. S6). 135 We replicated the main findings of our original experiment for ranks of log(k) ( Table S5) and correlation 136 between log(k) in SV and LV tasks (Fig. 4B). The Pearson correlation between NV and SV tasks (Fig. 4A) 137 was lower than expected given the 95% confidence intervals of the resampled correlations of the main 138 experiment and assuming 25 subjects (SI Results). This suggests that some of the correlation between SV and 139 NV tasks in the main experiment may be driven by visuo-motor similarity in experimental designs. We did not 140 find shifts or scaling between the posterior distributions of log(k) across tasks in this control experiment (   Figure 4. A,B: Control experiment 1 (n=25). The logs of delay-discounting coefficients in SV task (x-axis) plotted against the logs of delay-discounting coefficients in NV (A) and LV (B) tasks (y-axis). The color of the circles and the colorbar identify the ranks in NV task. Each circle is one subject. Pearson's r is reported on the figure. The error bars are the SD of the estimated coefficients. Solid line is the linear fit. Shaded area is the 95% CI of the linear fit. Dashed line is unity. C: Distribution of posterior parameter estimates (log(k)) from the model fit for the three tasks in control experiment 1. Probability density estimates were obtained for posteriors of log(k) for experimental tasks (k N V ∼ 1/sec, k SV ∼ 1/sec, k LV ∼ 1/day).

6/25
Strong effect of temporal context 144 We described above that the discount factors in the LV task, k LV , were almost equivalent (ignoring 145 unexplained variance) to those in the SV task k SV (Fig. 3B). However, the units of k LV are in 1/day and the 146 units of k SV are in 1/seconds. This finding implies that for a specific reward value, if a subject would decrease 147 their subjective utility of that reward by 50% for an increase from 5 to 10 seconds in the SV task, they would 148 also decrease their subjective utility of that reward by 50% for an increase from 5 to 10 days in the LV task. 149 This seems implausible, particularly from a neoclassical economics perspective. However, reward units also 150 change when moving from SV to LV task. In our sessions, the exchange rate in SV was 0.05 CNY per coin 151 (since all coins are accumulated and subjects are paid the total profit), whereas in LV, subjects were paid on 152 the basis of a single trial chosen at random using an exchange rate of 4 CNY for each coin. These exchange 153 rates were set to, on average, equalize the possible total profit between short and long delays tasks. However, 154 even accounting for both the magnitude effect [29,30] and unit conversion (calculations presented in SI Results) 155 the discount rates are still scaled by 4 orders of magnitude from the short to the long time-horizon tasks [53]. 156 One interpretation of this result is that subjects are simply ignoring the units and only focusing on the 157 number. This would be consistent with an emerging body of evidence that numerical value, rather than 158 conversion rate or units matter to human subjects [17,26]. A second possible interpretation is that subjects 159 normalize the subjective delay of the offers based on context, just as they normalize subjective value based on 160 current context and recent history [39,41,76,78]. A third possibility is that in the short delay tasks (NV and 161 SV) subjects experience the wait for the reward on each trial as quite costly, in comparison to the delayed 162 gratification experienced in the LV task. This "cost of waiting" may share some intersubject variability with 163 delay-discounting but may effectively scale the discount factor in tasks with this feature [55].

164
In an attempt to disentangle these possibilities, we ran a control experiment (n=16 subjects) using two 165 verbal discounting tasks (control experiment 2, SI Method ). In one task, the offers were in days (DV). In the 166 other, the offers were in weeks (WV). This way, we could directly test whether subjects would discount the 167 same for 1 day as 1 week (i.e. ignore units) or 7 days as 1 week (i.e. convert units). We found strong evidence 168 for the latter ( Having ruled out the possibility that subjects ignore units of time, we test our second potential explanation: 173 that subjects make decisions based on a subjective delay that is context dependent. We reasoned that if 174 choices are context dependent then it may take some number of trials in each task before the context is set.

175
Consistent with this reasoning, we found a small but significant adaptation effect in early trials: subjects are 176 more likely to choose the later option in the first trials of SV task (Fig. 5B,C). It seems that, at first, seconds 177 in the current task are interpreted as being smaller than days in the preceding task, but within several trials 178 days are forgotten and time preferences adapt to a new time-horizon of seconds. Using three tasks, we set out to test whether the same delay-discounting process is employed regardless of the 181 verbal/non-verbal nature of the task and the time-horizon. We found significant correlations between subjects' 182 discount factors in the three tasks, providing evidence that there are common cognitive (and presumably 183 underlying neural) mechanisms driving the decisions in the three tasks. In particular, the strong correlation 184 between the short time-horizon non-verbal and verbal tasks (r = 0.79, Fig. 3A) provides the first evidence for 185 generalizability of the non-verbal task; suggesting that this task can be applied to both human and animal 186 research for direct comparison of cognitive and neural mechanisms underlying delay-discounting. However, the 187 correlation between the short-delay/non-verbal task and the long-delay/verbal task is lower (r = 0.36). Taken 188 together, our results suggests animal models of delay-discounting may have more in common with short 189 time-scale consumer behavior such as impulse purchases and "paying-not-to-wait" in mobile gaming [22] and 190 caution is warranted when reaching conclusions from the broader applicability of these models to long-time 191 horizon real-world decisions, such as buying insurance or saving for retirement.

192
Stability of preferences 193 The question of stability is of central importance to applying in-lab studies to real-world behavior. There are 194 several concepts of stability that our study addresses. First, is test/re-test stability; second, stability across the 195 verbal/non-verbal gap; third, stability across the second/day gap. Consistent with previous studies [4,40,50], 196 we found high within-task reliability. Choices in the same task did not differ when made at the beginning or the 197 end of the session nor when they were made in sessions held on different days even 2 weeks apart (SI Results). 198 To our knowledge, there are no studies comparing stability across the verbal/non-verbal gap for 199 delay-discounting. The closest literature that we are aware finds that value encoding (the convexity of the domain [81]. It may be that unlike time or value, probability is processed differently in verbal vs. non-verbal 203 settings [33].

204
There are two aspects to the time-horizon gap that may contribute independently to differences in subjects' 205 preferences between our short and long tasks. First, there is the difference in order of magnitudes of the delays. 206 Second, there is a difference in the experience of the delay, in that all delays are experienced in the short tasks, 207 but only one delay is experienced in the long task.

208
Our control study comparing discounting of days vs. weeks eliminated the second factor since only one 209 delay was experienced for both days and weeks tasks. We found almost perfect correspondence between the 210 choices in the two tasks ( Fig. 5A): subjects discounted 7 days as much as they discounted one week. However, 211 days and weeks are only separated by one order of magnitude, while seconds vs. days are five orders of 212 magnitude apart. So while the days/weeks experiment provides some evidence that the magnitude of the 213 delays does not contribute substantially to variance in choice, it may be that larger differences (e.g. comparing 214 hours vs. weeks) may produce an effect. The evidence from the literature on this issue is mixed. On the one 215 hand, some have found that measures of discount factors on month long delays are not predictive of discount 216 factors for year-long horizons (a difference of one order of magnitude) [44,74] but others have found consistent 217 discounting for the same ranges [36]. Other studies that compared the population distributions of discount 218 factors for short (up to 28 days) to long (years) delays (2 orders of magnitude) found no differences in subjects' 219 discount factors [2,21]. Some of these discrepancies can be attributed to the framing of choice options: 220 standard larger later vs. smaller sooner compared to negative framework [44], where subjects want to be paid 221 more if they have to worry longer about some negative events in the future.

222
Several previous studies have compared discounting in experienced delay tasks (as in our short tasks) with 223 tasks where delays were hypothetical or just one was experienced [36,40,53,64]. For example, Lane et al [40], 224 also used a within-subject design to examine short vs. long delays (e.g. similar to our short-verbal and 225 long-verbal tasks) and found similar correlations (r ∼ 0.5 ± 0.1) with a smaller sample size (n=16). Consistent 226 with our findings, they found (but did not discuss) a 5 order of magnitude scaling factor between subjects 227 discounting of seconds and days suggesting that this scaling is a general phenomenon. 228

8/25
Cost of waiting vs. discounting future gains 229 It may seem surprising that human subjects would discount later rewards, i.e. choosing immediate rewards, in 230 a task where delays are in seconds. After all, subjects cannot consume earnings immediately. Yet, this result is 231 consistent with earlier work that suggests individuals derive utility from receiving money irrespective of when 232 it is consumed [48,49,63]. In our design, a pleasing (as reported by subjects) 'slot machine' sound 233 accompanied the presentation of the coins in the short-delay tasks. This sound can be interpreted as an 234 instantaneous secondary reinforcer [38]. Further, this result is consistent with studies which find that humans 235 exhibit discount rates comparable to other species when consuming liquid rewards [35]. On the other hand, 236 this would not be surprising for those who develop (or study) "pay-not-to-wait" video games [22], which 237 exploit player's impulsivity to acquire virtual goods with no actual economic value.

238
Using a seconds time-horizon may lead one to question if we can measure delay-discounting or if we are 239 capturing the cost of waiting [52]. Waiting or doing nothing, "builds up anxiety and stress in an individual due 240 both to the sense of waste and the uncertainty involved in a waiting situation" [54]. These different 241 interpretations (Paglieri [55] described the delayed option being framed as 'waiting' in seconds compared to 242 'postponing' in days time-horizon) may lead one to question what we can learn from comparing within-subject 243 behavior across tasks. Although it is not known how time is perceived, e.g. subjects could overestimate the 244 duration of the short delay, which will lead to greater discounting, we argue that the significant correlations 245 observed indicate there are some shared biological mechanisms underlying each of the three delay-discounting 246 tasks, which could explain why our inability to resist a candy in a seconds time-horizon self-control task 247 predicts our ability to complete college and other long time-horizon behaviors [13,19,28,51] (but see [77]).

248
Subjective scaling of time 249 The range of rates of discounting we observed in the long-verbal task was consistent with that observed in 250 other studies. For example, in a population of more than 23,000 subjects the log of the discount factors ranged 251 from -8.75 to 1.4 ( [69], compare with Fig. 3B). This implies that, in our short tasks, subjects are discounting 252 extremely steeply. i.e. they are discounting the rewards per second about the same amount that they 253 discounted the reward per day. This discrepancy has been reported before [40,53]. We consider three 254 (non-mutually exclusive) explanations for this scaling. First, subjects may ignore units. However, by testing 255 overlapping time-horizons of days and weeks we confirmed that subjects can pay attention to units. Second, it 256 may be that the costs of waiting [14,53,55] (discussed above) compared to the cost of postponing is, 257 coincidentally, the same as the number of seconds in a day. 258 We feel this coincidence is unlikely, and thus favor the third explanation: temporal context. When making 259 decisions about seconds, subjects 'wait' for seconds and when making decisions about days subjects 'postpone 260 reward' for days [55]. Although our experiments were not designed to test whether the strong effect of 261 temporal context was due to normalizing, existence of extra costs for waiting in real time, or both, we did find 262 some evidence for the former (Fig. 5C). Consistent with this idea, several studies have found that there are 263 both systematic and individual level biases that influence how objective time is mapped to subjective time for 264 both short and long delays [80,83]. Thus, subjects may both normalize delays to a reference point and 265 introduce a waiting cost at the individual level that will lead short delays to seem as costly as the long ones. 266  Using posted flyers, we initially recruited 35 students but added 32 more to increase statistical power (the 271 power analysis indicates that the total of 63 participants is adequate to detect a medium to strong correlation 272 across subjects, SI Results).

273
The study was approved by the IRB of NYU Shanghai. The subjects were between 18-23 years old, 34 The experiments were constructed to match the design of tasks used for rodent behavior in Prof. Erlich's lab. 285 For the temporal discounting task, the value of the later option is mapped to the frequency of pure tone 286 (frequency ∝ reward magnitude) and the delay is mapped to the amplitude modulation (modulation period ∝ 287 delay). The immediate option was the same on all trials for a session and was unrelated to the sound.

288
Through experiential learning, subjects learned the map from visual and sound attributes to values and 289 delays. This was accomplished via 6 learning stages (0, 1, 2, 3, 4, 5) that build up to the final non-verbal task 290 (NV) that was used to estimate subjects' discount-factors. Briefly, the first four stages were designed to (0) 291 learn that a mouse-click in the middle bottom 'reward-port' produced coins (that subjects knew would be 292 exchanged for money), (1) learn to initiate a trial by a mouse-click in a highlighted port, (2) learn 'fixation': to 293 keep the mouse-cursor in the highlighted port, (3) associate a mouse-click in the blue port with the sooner 294 option (a reward of a fixed 4 coin magnitude that is received instantly) (4) associate varying tone frequencies 295 with varying reward at the yellow port (5) associate varying amplitude modulation frequencies with varying 296 delays at the yellow port. On each trial of the stage 3,4 & 5 there was either a blue port or a yellow port (but 297 not both). The exact values for reward and delay parameters experienced in the learning stages correspond to 298 values that are used throughout the experiment. After selecting the yellow-port (i.e. the delayed option), a 299 countdown clock appeared on the screen and the subject had to wait for the delay which had been indicated by 300 the amplitude modulation of the sound for that trial. Any violation (i.e. a mouse-click in an incorrect port or 301 moving the mouse-cursor during fixation) was indicated by flashing black circles over the entire "poke" wall  circles presented on the screen, in place of using sounds. Second, in the non-verbal and verbal short delay 318 sessions, subjects continued to accumulate coins (following experiential learning stages) and the total earned 319 was paid via electronic payment at the end of each experimental session. In the long-verbal sessions, a single 320 trial was randomly selected for payment (method of payment commonly used in human studies with long 321 delays, [17]) and shown at the end of the session. The associated payment is made now or later depending on 322 the subject's choice in the selected trial. For model-based analysis we use hierarchical Bayesian analysis (HBA) brms, 2.0.1 [10,12] that allows for 325 pooling data across subjects, recognizing individual differences and finding full posterior distributions, rather 326 than point estimates of parameters. The means of HBA posteriors of the individual discount-factors for each 327 task are almost identical to the individual fits done for each experimental task separately using maximum 328 likelihood estimation through fmincon in Matlab (SI Method, Fig. S1 & Fig. S2). We further validated the 329 HBA method by simulating choices from a population of 'agents' with known parameters and demonstrating 330 that we could recover those parameters given the same number of choices per agent as in our actual dataset. 331 The first non-verbal session data was excluded from model-fitting due to a comparatively high proportion of 332 first-order violations than in the following two non-verbal sessions (see further discussion in SI Results). (estimated using Eq. 1) into a probability of later choice for each subject. Two functions below rely on the four 343 parameters (k i,s : (k i,N V ,k i,SV ,k i,LV ), the discounting factor per subject*task, and τ i individual decision noise). 344 Hyperbolic utility model: where V is the current value of delayed asset and T is the delay time.

346
Softmax rule: where L is the later, S is the sooner offer and τ i is the individual decision noise.

348
For plotting posteriors of log(k) we calculated probability density estimates (for smoothing) using the 349 ksdensity function in Matlab. The estimate is based on a normal kernel function, and is evaluated at 350 equally-spaced 100 points, x i , that cover the range of the data in x.

351
To test for differences across tasks we examined the HBA fits using the brms::hypothesis function. This 352 function allows us to directly test the posterior probability that the log(k) is shifted and/or scaled between 353 treatments. This function returns an "evidence ratio" which tells us how much we should favor the hypothesis 354 over the inverse (e.g. P (a>b) P (a<b) ) and we used Bayesian confidence intervals to set a threshold (p < 0.05) to assist 355 frequentists in assessing statistical significance.

356
The bootstrapped (mean, median and variance) tests are done by sampling with replacement and   Rstan [32] for Bayesian nonlinear multilevel modeling [10], shinystan [27] was used to diagnose and develop 366 the brms models. Package lme4 was used for linear mixed-effects modeling [6]. 367

11/25
Data Availability 368 Software for running the task, as well as the data and analysis code for regenerating our results are available at 369 github.  reward. There is no consensus which functional form of delay-discounting best describes human behavior.

383
Although the exponential model [67] of time discounting has a straightforward economic meaning: a constant 384 probability of loss of reward per waiting time, the hyperbolic model [47] seems to more accurately describe how 385 individuals discount future rewards, in particular preference reversals [7]. In order to be sure that our results 386 and main conclusions did not depend on the method (e.g. hierarchical Bayesian vs. maximum likelihood 387 estimation of individual subject parameters) or functional form (e.g. exponential vs. hyperbolic), we validated 388 our results with several methods.

389
Exponential model: (S1) Hyperbolic model: where V is current value of delayed asset, T is the delay time and k i is the individual discounting factor.

392
We considered both a shift-invariant softmax rule and a scale-invariant matching rule to transform the 393 subjective utilities of the sooner and later offers into a probability of choosing the later offer.

394
Softmax: Matching rule: where L i is the later, S i is the sooner option, and τ i is the individual decision rule noise, or temperature.

397
Using maximum likelihood estimation we fit each subject's choices to four baseline models: 1) hyperbolic 398 utility with softmax, 2) exponential utility with softmax, 3) hyperbolic utility with matching rule and 4) 399 exponential utility with matching rule. We also considered models that account for utility curvature, i.e. V is 400 replaced by V αi and models that account for trial number and cumulative waiting time. 401 In the models that account for trial number or cumulative wait time the individual discounting factor, k i 402 consists of a constant component k ci and time-dependent component: where tr is the trial number and tw i is the individual's total waiting time in seconds that exists only for 405 short delays task and θ i is a scaling parameter.

407
We first estimated subjects' time-preferences individually (since discounting factors differ among people) for 408 each experimental task with maximum likelihood estimation (MLE) and used leave-one-trial-out 409 cross-validation for model comparison. Based on the Bayesian information criterion criterion (BIC , Table S1) 410 and number of subjects that were well described by the model (Table S2), the softmax-hyperbolic model was 411 selected as the best nonlinear model, and we used this for the Bayesian Hierarchical modeling. Since the estimation procedure was identical for all MLE fits (Matlab code available on github repository) 413 we describe it using the softmax-hyperbolic model as an example. This is a 2-parameter model to estimate given set of parameters, the model predicts that trial 1 will result in 80% chance of the subject choosing later 417 option, and the subject, in fact, chose the later, the trial would be assigned a likelihood of 0.8 (if the subject 418 chose sooner, the trial would have a likelihood of 0.2). Finally, we perform a leave-one-out cross-validation for 419 each subject-task to avoid overfitting. We leave one trial out and use the rest of the trials in the experimental 420 task to predict this trial. We repeat this procedure for each trial. 421 Figure S1 shows an example of the model fit and estimated parameters for one of the subjects. For this 422 particular subject fitting softmax-hyperbolic model in the non-verbal task resulted in delay-discounting 423 coefficient k = 0.07 (Fig. S1). We can readily observe that although first-order violations were present during 424 non-verbal task, in the verbal task they get eliminated. Although the unit-free discount rates seem pretty 425 stable, the decision noise τ i gets smaller from non-verbal to verbal tasks.   where later reward is the later reward, sooner reward is the sooner reward; logk is the natural logarithm of 439 the discounting parameter k and noise (τ ) is the decision noise (like in Eq. S2 and S3, respectively). The  The fits from HBA model are almost identical to the individual fits done for each experimental task 452 separately using softmax-hyperbolic model and 'fmincon' function in Matlab (Fig. S2). The individual log(k) 453 values also agree with the range of delay-discounting values reported in a large cohort GWAS delay-discounting 454 study [69]. The rank correlation values for individual fits in table S2 correspond to the ones in the main text 455 both in magnitude and significance. 456

15/25
For nonparametric analysis we calculated a coarse model-free measure: percentage of trials in which the 457 later option was chosen in each of the three tasks (percent 'yellow' choice). There was significant heterogeneity 458 between subject's responses (NV: mean = 0.45, std = 0.19; SV: mean = 0.59, std = 0.23; LV: mean = 0.59, 459 std = 0.26; our experiments considered both larger and smaller later options, with the percentage of smaller 460 later options bigger for nonverbal than for verbal tasks, see further discussion on first-order violations).

461
Nevertheless, if we ranked the subjects by the fraction of trials they chose the later option in each task, there 462 is a strong correlation between subjects' ranks across tasks (SV vs. NV, Spearman r = 0.71 (p < 0.01); SV vs. 463 LV, r = 0.49 (p < 0.01); NV vs. LV, r = 0.30 (p < 0.05), regardless of gender or nationality (see 'Significance 464 Tests of Demographic and Psychological Categories' below).

466
The notion of first-order stochastic dominance is usually defined for gambles [75,82]. Compliance with 467 first-order stochastic dominance means that, in principle, this observed behavior can be adequately modeled 468 with a utility-function style analysis. Utility-function analysis is controversial in part if subjects show 469 inconsistent behavior and require significant variation within subjects. By design, our experiments considered 470 both larger and smaller later options generating significant variation to identify utility function parameters.

471
Overall, in the non-verbal task there was 25% of smaller later options, whereas in the verbal experiment there 472 was 10% of smaller later options. Given that the smaller later option is always strictly worse than larger Wilcoxon rank sum test, p < 0.1). 495 We used the Barratt Impulsiveness Scale (BIS-11; [57]) as a standard measure of impulsivity. This test is 496 reported to often correlate with biological, psychological, and behavioral characteristics. The mean total score 497 for our students sample was 61.79 (std = 9.53), which is consistent with other reports in the literature 498 (e.g., [71]). The BIS-11 did not correlate significantly with the estimated delay-discounting coefficients (BIS vs.  Given that we are estimating subjects discount-factors using a finite number of trials per task, even if subjects' 503 discount-factors were identical in different tasks, we would not expect the rank correlations to be perfect. In 504 order to estimate the expected maximum correlation we could observe we simulated "consistent" subjects' 505 choices using hyperbolic model with softmax rule, assuming that there is a single delay-discounting parameter -506 k i (mean across tasks) for each subject. We did this 100,000 times and computed a distribution of pairwise 507 rank correlations (Fig. S3). These simulations revealed that the decision noise contributed a very small 508 amount of variance (Table S3).

517
Next, we combined the 7-vectors for each subject into a 21-vector and performed a 518 leave-one-subject/task-out cross-validation. For each subject-task, we trained a linear model to predict the 519 choices from one task based on the other two tasks for all other subjects. Then, for the left out subject we 520 predicted each task from the other two. We can predict 70% of the subjects 21-vectors this way (Fig. S4). We 521 conclude that, for most subjects, there is a shared scaling effect that allows each tasks' choices to be predicted 522 by the other two. In our main experimental tasks we used two types of delays: in seconds and in days, where 1 day = 86400 525 seconds. We also used three types of exchange rates: for non-verbal task 1 coin = 0.1 RMB; for verbal short 526 delay 1 coin = 0.05 RMB; for long delay 1 coin = 4 RMB. Humans tend to discount large rewards less steeply 527 than small rewards, i.e. discounting rates tend to increase as amounts decrease [29,30]. We re-calculated the 528 17/25  We ran power analysis to find out total sample size required to determine whether a correlation coefficient 545 differs from zero. For expected correlation r = 0.5 and 80% power (the ability of a test to detect an effect, if 546 the effect actually exists, [9,16]) the required sample size is N = 29, for a medium size correlation of r = 0.3 -547 N = 84.

548
Learning Stages Analysis 549 64 (1 out of 64 did not complete all sessions of the study, the analysis below is done for 63 subjects) out of 67 550 subjects passed learning stages (SI Movies) for our novel language-free task.

551
There were 6 learning stages (0, 1, 2, 3, 4, 5). The first four stages were respectively designed to 0) learn 552 the reward port, 1) learn the initiation port, 2) fixation, and 3) associate the blue colored port with the sooner 553 option (a reward of a fixed 4 coin magnitude that is received instantly). 4 trials without violations in a row are 554 required to pass these learning stages. In stage 4, subjects are primed to the sound frequency to learn the 555 variability of reward magnitudes (1, 2, 5, 8, or 10 coins): first, the lower and upper bounds, then, in ascending 556 and descending order and, finally, in random order. In the final stage 5, subjects heard the AM of a sound 557 during fixation that is now mapped to the delay (3, 6.5, 14, 30, or 64 seconds) of the later option. The order of 558 the stimuli presented was the same as in the previous stage. For the last two stages 4 trials without violations 559 in a row are required during the random order of stimuli presentation to pass. Learning stages were shorter for 560 the 2nd and the 3rd non-verbal sessions, since only random rewards and delays were presented for 'forced'   Subjects experienced difficulty with the learning stage 2. This is the only stage where the average performance 567 was less than 40% compared to more than 70% performance in other stages (Table S4). Subjects on average 568 also spent significantly more time measured in seconds for the learning stage 2 compared to the next learning 569 stage 3 (Wilcoxon signed-rank test, p < 0.01), although only 4 trials without violations are required to pass 570 this learning stage and learning stage 3 includes more steps. During learning stage 2 subjects had to learn 571 fixation. Fixation was specifically designed to drive subjects attention away from the computer mouse (since 572 no movement is allowed outside of the circle during fixation) and bring focus to other senses. During fixation 573 in the learning stages 4 and 5 (as well as in the decision stages) subjects hear sound that corresponds to the 574 reward magnitude and delay.

575
There were three major patterns of violations: 1. subject had difficulty passing stage 2 (because of fixation 576 violations), however later stages were completed quickly; 2. subject was able to pass stage 2, by having 4 577 correct answers in a row, but during stage 4 encountered problems with fixation violation again; 3. subject was 578 able to proceed till stage 5 almost without violations, but was stopped by several fixation violations at stage 5. 579 Figure S5 shows the second pattern of correct choices vs. violations. In total, 25 (29 started, 4 withdrew) undergraduate students from NYU Shanghai participated in 5 583 experimental sessions (3 non-verbal and 2 verbal sessions, in this sequence, that were scheduled bi-weekly).

584
The study requirements in order to meet the IRB protocol conditions remained the same as in the main 585 experiment (Materials and Methods). In each session, subjects completed a series of intertemporal choices.

586
Across sessions, 160 trials were conducted in each of the following tasks mimicking the main experiment, i) 587 non-verbal (NV), ii) verbal short delay (SV; 3 seconds -64 seconds), and iii) verbal long delay (LV; 3 days -64 588 19/25 days). In each trial, irrespective of the task, subjects made a decision between the sooner and the later options. 589 The NV task was exactly the same as in the main experiment. All subjects passed learning stages. The SV 590 and LV tasks differed from the main experiment in exactly two ways: 1) the stimuli presentation didn't include 591 a display of circles of different colors. Instead, two choices were presented on the left or on the right side 592 (counterbalanced) of the screen (Fig. S6), 2) the subjects didn't have to click on the circles using mouse, 593 instead they used a keyboard to indicate 'L' or 'R' choice. Everything else stayed the same as in the main 594 experiment, i.e the last two sessions included an alternating set of verbal tasks: SV-LV-SV-LV (or 595 LV-SV-LV-SV, for a random half of subjects), the payment was done differently for SV and LV (randomly 596 picked trial for payment in LV, Materials and Methods), etc. The purpose of this control experiment is to 597 confirm that significant correlation between non-verbal tasks and verbal tasks we report in Results is not an 598 artifact of our main experimental design: subjects experience the same visual display and motor responses in 599 the non-verbal and verbal tasks and this design similarity might drive the correlation between time-preferences 600 in these tasks. Instead, in this control experiment the verbal tasks are made as similar as possible (keeping our 601 experiment structure) to typical intertemporal choice tasks used in human subjects. Individual delay-discounting fits were estimated using HBA softmax-hyperbolic model using the same 613 procedure as in the main experiment. No-Cirles (NC) data was analyzed as is, keeping delay units in seconds 614 and in days, whereas days-weeks (DW) data was analyzed after converting delays in weeks to days. Among 615 estimated delay-discounting coefficients (Fig. 4C), there is no significant difference in means of log(k) between 616 tasks (bootstrapped mean (and median) tests, SV vs. LV and NV vs. SV, all p > 0.8). Thus, we find that 617 similar to the main experiment there is no common shift across tasks and individual effects per task explain a 618 significant amount of variance. For DW, based on Fig. 5A we conclude that day delay task and week delay 619 task are likely to be perceived by subjects as a single delay task with different units in it. Subjects do pay 620 attention to units and individual differences between delay-discounting coefficients do matter, while the 621 differences between the tasks do not.

623
Even with a smaller subject's pool (25 subjects) for the NC control experiment the correlation of ranks of 624 log(k) between SV and NV tasks stays strong, while the Pearson correlation becomes a bit smaller (Fig. 4A,B; 625 Table S5). To determine whether the correlations observed were within the range expected by chance, we 626 repeatedly (10,000 times) randomly sampled 25 of the original 63 subjects (from Fig. ) and computed the  The correlation between log(k) and ranks of log(k) for DW experiment is almost perfect (Fig. 4D; Pearson 631 r = .97; Spearman r = .95, all p < 0.01). This suggests that subjects are making choices in day and week delay 632 tasks by converting these delays to a common unit. 633 We don't find any order effects for the NC control experiment (bootstrapped mean test, SV-LV-SV-LV vs. 634 LV-SV-LV-SV order for SV and LV log(k), respectively, all p > 0.6) as well as for the DW control experiment 635 (bootstrapped mean test, DV-WV-DV-WV vs. WV-DV-WV-DV order for DV and WV log(k), respectively, all 636 p > 0.2). To confirm the absence of order effect we also run an order model with DW data, where 637 (log(k) ∼ (order|subjid)). The comparison based on 10-fold cross validation criteria (using the 'kfold' function 638 in the 'brms' package) between order and main models is in favor of the latter (order = 3467.59 ¿ 3261.15 = 639 main), since lower is better.