Robust priors for regularized regression

Induction benefits from useful priors. Penalized regression approaches, like ridge regression, shrink weights toward zero but zero association is usually not a sensible prior. Inspired by simple and robust decision heuristics humans use, we constructed non-zero priors for penalized regression models that provide robust and interpretable solutions across several tasks. Our approach enables estimates from a constrained model to serve as a prior for a more general model, yielding a principled way to interpolate between models of differing complexity. We successfully applied this approach to a number of decision and classification problems, as well as analyzing simulated brain imaging data. Models with robust priors had excellent worst-case performance. Solutions followed from the form of the heuristic that was used to derive the prior. These new algorithms can serve applications in data analysis and machine learning, as well as help in understanding how people transition from novice to expert performance.


Introduction
Inference from data is most successful when it involves a helpful inductive bias or prior belief. Regularized regression approaches, such as ridge regression, incorporate a penalty term that complements the fit term by providing a constraint on the solution, akin to how Occam's razor favors solutions that both fit the observed data and are simple. By incorporating such constraints or prior beliefs, the hope is that models will better predict future outcomes.
What makes a good prior belief or inductive bias? In the case of ridge regression, the norm of the regression coefficients is shrunk toward zero (Hoerl & Kennard, 1970;Tikhonov, 1943) to control model complexity and reduce overfitting. However, in many domains, zero is not a reasonable a priori guess for the true association between variables. For example, it would be strange to a priori predict the quality of a new home would be unaffected by the experience of the workers, quality of materials, reputation of the architect, etc. Because the world is somewhat predictable, a prior centered on the origin (Fig. 1) is inappropriate.
If not zero, where does one turn for a useful prior? One answer is to look to human behavior. Humans use an assortment of clever strategies for learning and decision-making that perform well even in conditions of low knowledge. Simple heuristics that are fast and frugal (Czerlinski, Gigerenzer, & Goldstein, 1999) excel when training examples are scarce (Parpart, Jones, & Love, 2018). People can also shift to more complex strategies when resources are available (Rieskamp & Otto, 2006). With increasing experience and expertise, humans often acquire a sophisticated understanding of domains.
Although heuristics are efficient and robust models in their own right, we propose they are a useful starting point or prior for more complete characterizations of domains. Advantages of heuristics include their ecological validity (Czerlinski et al., 1999; S. Bobadilla-Suarez et al. ≠ ⃖ ⃗ ). Using priors based on a heuristic (i.e., a constrained model) can increase robustness and interpretability. Eq. (1) is shown at the bottom of panel B. The other equation simply drops , equivalent to standard notation for ridge regression, here termed the Zero prior model.

Fig. 2. TAL and TTB decision heuristics. (A)
A hypothetical decision -choosing between a rural (−1) or urban (+1) home based on cues ordered by cue validity: low pollution, low price, and proximity to museums. Each cue is coded as −1 when favoring the left option (rural), +1 for the right option (urban), and 0 when the two options are equal on that cue. TAL sums cue values, choosing the rural home. TTB chooses based on the best cue (measured bŷ, see Eq. (4)) that distinguishes the options. Here, TTB would choose the urban home based solely on proximity to museums. (B) The covariance of the cues with the criterion (urban or rural home), whicĥmeasures. (C) The covariance of the cues with one another; TAL and TTB heuristics disregard this information. (D-F) Illustrations of OLS, TAL and TTB. Here, OLS strikes a balance for correlated cues; low price and proximity to museums are (negatively) correlated. Thus, low pollution presents a higher weight 1 . TAL and TTB equate the absolute value of all weights. TTB additionally ranks and scales predictors according to their predictive valuêin a non-compensatory way (multiplying cues by powers of 2). Mata et al., 2012) and robustness across decision problems. Their weakness is insensitivity to aspects of the data due to their rigid inductive bias (Geman, Bienenstock, & Doursat, 1992;Parpart et al., 2018). This weakness is ameliorated when heuristics function as priors within more complex models because priors can be overcome by additional data, much like how human experts develop more complex and nuanced knowledge with increasing experience in a domain. When data are abundant, the encompassing model would master the subtleties of the domain, whereas when data are scarce the heuristic prior would help guide predictions and increase robustness. Because the heuristics themselves are interpretable models, the solution of the encompassing model could be understood in terms of deviations from the heuristic prior. Accuracy and normalized entropy for the 20 datasets in Application 1. Training set size was fixed at 50. (A) Test set accuracy across penalty values for the Obesity dataset. At low penalties, the models all agree with OLS (unpenalized). Under strong penalties, the Zero prior model (standard ridge regression) converges to chance performance as weights shrink toward ⃖ ⃗ . TAL-prior and TTB-prior models converge toward their respective heuristics and are robust. (B) Test set accuracy for all 20 datasets for best and worst performing penalty values for each model (see SI). The OLS permuted prior model is a penalized regression model with a permuted OLS solution as prior. Heuristic prior models are most robust. (C) Normalized entropy (Eq. (17)) for the Professors' Salaries dataset, which measures how compensatory the weights are. The TAL-prior model becomes maximally compensatory as penalty increases, unlike the TTB-prior model. (D) Normalized entropy for all 20 datasets across all penalty values, which orders as TAL-prior, Zero prior, TTB-prior. For B and D, violins represent density estimates for the respective metric. Each dot is one of 20 datasets. For A and C, shaded areas represent 1 standard deviation.

Robust priors based on decision-making heuristics
We used two well-known heuristics, tallying (TAL) and take-the-best (TTB) (Czerlinski et al., 1999), as priors in regularized regression models. These heuristics predict which of two options is preferable. For example, TAL or TTB could predict whether a rural or urban home is preferable based on several cues ( Fig. 2A). Each cue is ternary valued and indicates whether the left (−1) or right (+1) option is preferred on that dimension, with 0 for a tie. TAL is a simple majority voting rule whereas TTB bases its decision on the single most predictive cue that can discriminate between the alternatives (both heuristics explained below; also, see Fig. 2).
To use a heuristic as a prior, we use a two-step model-fitting procedure (cf. (Zou, 2006)). In step 1, we fit the heuristic to the training data. The resulting point estimate for the weight vector provides the penalty term (weighted by the penalty parameter ) within a regularized regression model in step 2. The penalty term shrinks regression coefficients toward the heuristic solution, as opposed to ⃖ ⃗ as in ridge regression (Fig. 1). Increasing increases the strength of the prior, eventually pushing the regression solution to fully agree with the heuristic (cf. (Parpart et al., 2018)).
Our approach integrates heuristics with full-information (regression) models in a principled way that applies to a broad class of heuristics. The approach is to subtract a carefully constructed vector inside the penalty term of the well-known 2 cost function used in ridge regression. The cost function for standard ridge regression iŝ where ‖ ⋅ ‖ 2 is the Euclidean ( 2 ) norm, is the dependent variable [ 1 , … , ] , is an × matrix with one column for each of the predictor variables , is a column vector of zeros ⃖ ⃗ = [0 1 , … , 0 ] , is a vector of estimated regression coefficients [ 1 , … , ] , and ≥ 0 is a tunable penalty parameter. 1 Note that implicit priors are also found in regularized regression with all other norms ( ) too, including LASSO regression ( 1 ). Thus, our insight generalizes to all other norms as well.
S. Bobadilla-Suarez et al. The first term inside the arg min of Eq. (1) promotes goodness-of-fit in the model, whereas the second term -known as the penalty term -promotes smaller weights . As increases, the weights tend to in the limit (Fig. 1). (The derivation of the optimal weights * is included in the SI.) However, when = 0, the model is equivalent to ordinary least squares (OLS) regression. OLS regression estimates coefficients without the penalty term: Normally, is not included in Eq. (1); it is implicit in more standard specifications of ridge regression, where the penalty term is written simply as ‖ ‖ 2 2 . Nevertheless, instead of a ⃖ ⃗ vector, one can generalize ridge regression with alternative constructions of ( Fig. 1). As argued above, choosing this vector intelligently might improve learning of̂by imposing a more sensible inductive bias (Geman et al., 1992). Although the decision-making literature has traditionally proposed that humans use certain classes of heuristics due to cognitive limitations (Bobadilla-Suarez & Love, 2018;Kahneman, Slovic, & Tversky, 1974;Simon, 1978), heuristics have also been justified from their ecological validity (Czerlinski et al., 1999;Dawes, 1979;Parpart et al., 2018). That is, the inductive biases they embody agree with the statistical structure of many natural environments, thus leading to better performance. Taking inspiration from the TAL and TTB heuristics, both in their success in describing human decision making (Bobadilla-Suarez & Love, 2018;Bröder, 2003;Otworowska, Blokpoel, Sweers, Wareham, & van Rooij, 2018) and in application to real world statistical problems (Czerlinski et al., 1999;Parpart et al., 2018), we propose a construction of based on these heuristics.
Below we discuss how to construct priors from TAL and TTB. We then report three applications. In the first application, we compared generalization performance (test set accuracy) and interpretability of model solutions on 20 classical datasets previously used in the decision-making literature (Czerlinski et al., 1999;Parpart et al., 2018). There, the decision-making problem was to choose the better item within a pair (see Fig. 2A and below). In the second application, we evaluated our approach within a classification paradigm in which a single item is assigned to one of two classes (e.g., friend or foe?). In the final application, we demonstrated the generality and benefits of our approach by analyzing simulated brain imaging data where the prior is derived from a technique (Mumford, Turner, Ashby, & Poldrack, 2012) that seeks to minimize collinearity amongst predictors in a manner that parallels how we derive heuristic priors.

TAL and TTB heuristics
TAL and TTB do not adapt their form or complexity in light of the data. For example, TAL is an equal-weights algorithm that uses only the signs of the coefficients (Czerlinski et al., 1999;Dawes, 1979): the estimated weightŝare constrained to be either −1 or 1 (Fig. 2E).
The Tallying decision rule (TAL) is defined aŝ = sign wherê= A cue's estimated cue validitŷis defined as the difference between the numbers of correct predictions and incorrect predictions , divided by the total number of predictions across all observations + (Martignon, Hoffrage, et al., 1999). 2 Observations that do not present a prediction (i.e., = 0) are ignored. Notably, cue validities depend only on the relationship between each cue and the outcome, and not on covariance between cue. Thus, the definition of̂is what makes heuristics insensitive to cue covariance. When assessing model performance, the validitieŝare estimated for each training set.
The Take-the-Best (TTB) decision rule iŝ Whereas TAL sums the signs of the predictors to determine its response, TTB relies on the top predictor that differentiates the two options. When there is no evidence for either option, both TAL and TTB choose randomly (with probability = 0.5 for each option). This occurs for TAL when Eq. (3) yields 0 and for TTB when every equals 0. We now define based on the TAL heuristic, referred to as . First, we determine a scalar coefficientŝ hared across all predictor variables : This equation is the same as Eq.
(2) except that the vector has been replaced by the product of a scalar and a column vector , which has cue directionalitieŝ= sign(̂). Using this scalar, we define:

=̂̂.
(8) To construct , we build from the intuition that TTB is equivalent to a noncompensatory weight vector such as 2̂, wherê is a vector of ascending ranks for the absolute cue validities,̂= |{ ′ ∶ |̂′ | < |̂|}|. Paralleling the definition of the , for the , we also determine a shared scalar: However, we have a new design matrix in Eq. (9), defined by =̂ (11) witĥbeing a diagonal matrix This transformation has the effect of encoding cue validity directly into the design matrix by scaling each regressor according to a geometric progression. In order for to function appropriately as a , the original design matrix is replaced with in Eq. (1). When instead ∶= 1, we recover the TAL prior. Note also that this entire procedure is nearly equivalent to working with the original design matrix and taking to be proportional tô̂(the vector of signed exponentiated ranks,̂) rather than tô, except that the weights are differentially penalized according to their ranks. With these priors defined, we can now formally specify two regularized regression models. The TAL-prior model is defined by Eq. (1) The TTB-prior model is defined by Eq. (1) with = and with replaced by . In contrast to OLS, the use of the common scalar for all cues in the prior for both TAL-prior and TTB-prior highlights that both heuristics are insensitive to cue covariance information (see Fig. 2E,F). For TAL-prior, the common scalar reflects the fact that TAL is a (fully) compensatory strategy, whereas the design matrix in TTB-prior, , reflects the fact that TTB is a non-compensatory strategy. Later, we will evaluate how these differing priors affect the nature of penalized regression solutions.

Logistic ridge regression
The first two applications reported here use logistic ridge regression (Le Cessie & Van Houwelingen, 1992;Schaefer, Roi, & Wolfe, 1984;van Wieringen, 2015). To estimate weights for penalized logistic regression,̂( ) , we first obtain a scale parameter for an unpenalized logistic regression via maximum likelihood, where as above the weight vector is constrained to be proportional to the cue directionalities: The likelihood for logistic regression is as usual: where ⋅ denotes the th row of . We then define as We then insert into our final objective function for regularized logistic regression: See Supplementary Information (SI) for an approximation of̂( ) for regularized logistic regression via the Newton-Raphson iterative algorithm.

Application I: Heuristic decision making
Regularized regression models with heuristic priors were evaluated on the 20 datasets that have been previously used to compare heuristics with ordinary least squares (OLS) regression (Czerlinski et al., 1999;Katsikopoulos, Schooler, & Hertwig, 2010;Parpart et al., 2018). For each of the 20 problems, the cues for the two options on each trial were binary valued (see Methods below for more details), which leads to ternary-valued inputs according to our coding scheme (see Fig. 2A).

Methods
The preprocessed data were retrieved from an Open Science Foundation (OSF) repository (Parpart, Jones, & Love, 2019), used to evaluate the half-ridge and COR models by Parpart et al. (2018). In accord with previous research, cue attributes were dichotomized by median split (Czerlinski et al., 1999;Parpart et al., 2018).
The data were transformed into a format appropriate for decision-making problems where all pairwise comparisons between observations were encoded as the signed differences in (binary) attributes (possible values: −1, 0, and +1). The decision in our coding scheme is −1 for the left choice and +1 for the right choice, which was mapped to 0 and 1 for logistic regression: e.g., in the homestead example ( Fig. 2A), rural is coded as −1 (or 0 for logistic models) and urban is coded as +1. This is common procedure in the decision-making literature (Czerlinski et al., 1999;Katsikopoulos et al., 2010;Parpart et al., 2018). Formally, this consists of training pairs ( 1 , 1 ), … , ( , ) with ∈ {−1, 0, 1} . Training sets consisted of 50 training pairs from which the priors were learned from. All results were computed for 1000 iterations (i.e., different partitions into training and test sets) for all penalty values.
As the penalty parameter increases, the penalized regression models with the and converge to their corresponding heuristics (Fig. 3A). As a sanity check, the TAL-prior and TTB-prior models were validated on simulated data in the SI (Figures S1 and S2, respectively) by tracking their agreement with OLS predictions. Effectively, agreement with OLS is higher for low penalty values and agreement with TAL-prior or TTB-prior is higher for high penalty values. Individual plots for each of the twenty datasets are also included in the SI (Figures S3 and S4).
Although regression models are interpretable in that each feature's importance follows from its weight, the heuristic penalty terms make clear how the prior shapes the solution and how the solution differs from the prior, which itself is an interpretable solution. To evaluate how the form of the solution changes as a function of the prior, we calculated normalized Shannon entropy defined as: ∶= 1 for TAL-prior and ∶= 2 for TTB-prior, and ‖ ⋅ ‖ 1 is the 1 norm, such that̃∈ [0, 1] for any number ( ) of predictors. Eq. (17) provides an intuitive measure of how compensatory a solution is. The measure will peak at 1 when the predictive force of the weights is uniform, as in the TAL heuristic.

Results
As predicted, these models are robust across the range of values because they converge to a reasonable estimate (i.e., a sensible heuristic). In contrast, while ridge regression performs well overall, its performance suffers at higher penalty values as its weights are pulled toward ⃖ ⃗ . The robustness of the penalized regression models with heuristic priors held across the 20 datasets (Fig. 3B). Notice that regularization using any nonzero prior is not sufficient for robustness -an ad hoc nonzero prior (OLS Permuted Prior) was not robust. The OLS permuted prior model is a penalized regression model with a permuted OLS solution as prior (i.e., where the weights from the OLS solution have been permuted).
We confirmed that̃for the TAL-prior model would converge to 1 with increasing penalty , in contrast to the TTB-prior model. We also predicted ridge regression's̃would be somewhat lower than TAL-prior's. That is, convergence to a ⃖ ⃗ weights vector for standard ridge regression is nonsensical and effectively resisted in the optimization, providing more heterogeneous weights than otherwise expected. These predictions held (Figs. 3C,D).
The results presented here hold under an alternative training scheme where we also evaluate OLS as a prior itself (Splitting training data in the SI). OLS performs worse as a prior on the majority of datasets ( Figure S7 in the SI) and also shows higher variance overall ( Figure S8 in the SI).
In these 20 decision problems, models using priors based on TAL and TTB were robust across the entire range of prior strengths. These penalized regression models shrunk to a reasonable prior based on a simple heuristic that discards covariance information amongst predictive cues. The forms of the solutions were interpretable and followed from the priors.

Application II: Breast Cancer classification
In this application, we conducted the same analyses as in Application I, but for a classification problem as opposed to a forced choice between two options. We applied the models to the Breast Cancer Wisconsin (Diagnostic) Data Set from the UCI data repository (Blake, Keogh, & Merz, 1995). In this task, models predicted whether an item was cancerous or not based on binary features (see Methods for more details). The predictors were discrete as in Application I, though the identical approach would apply to continuous predictors or to a mixture of discrete and continuous predictors. 3 S. Bobadilla-Suarez et al. The key finding is that models with heuristic priors are most robust. (B) Normalized entropy (Eq. (17)) averaged across the range of penalty values reflects how compensatory a model's predictions are, led by TAL-prior, followed by the Zero prior, and finally the TTB-prior model. Each dot represents one of the tested penalty values averaged over 1000 train-test splits. The gray violins represent the respective density estimates in both panels.

Methods
The data comprises nine cues ( = 9) that describe characteristics of the cell nuclei present in digitized images of fine needle aspirate (FNA) of breast masses (Blake et al., 1995). Data points with missing cue values were removed, resulting in a total of = 478 observations. All variables were binarized by median split.
In an analogous fashion to how we constructed in Application I, here we transformed the original data by median splits. For each cue, if the value was equal to the median, it received a value of 0, if it was above the median it was equal to +1 and if it was below the median it was equal to −1. Formally, this also produces training pairs ( 1 , 1 ), … , ( , ) with ∈ {−1, 0, 1} . Training sets consisted of 100 training pairs from which the priors were learned from. However, we did not construct a matrix of pairwise comparisons of observations as before. The dependent variable was binary, ∈ {−1, +1}, coding for malignant and benign tumors, respectively. This preprocessing of the data is closer to the way regression models are calculated for everyday applications. Both the mean test accuracy (Fig. 4A and Figure S5 in the SI) and the mean normalized entropy (Fig. 4B and Figure S6 in the SI) were averaged over 1000 iterations for each penalty value.

Results
The results were in accord with Application I. The models with a heuristic prior were robust across the range of values (Fig. 4A). As in Application I, the priors shaped the form of the solution in the predicted manner ( Fig. 4B) with the TAL-prior model having the most compensatory solutions.

Application III: Estimation in brain imaging analyses
In Applications I and II, the task was to generalize from training items to make decisions about test items. In Application III, the objective was to estimate the weights themselves. We considered simulated functional magnetic resonance imaging (fMRI) time series that allowed for comparing estimates to ground truth.
Brain imaging datasets are challenging to analyze because they measure the brain's hemodynamic response, which is a temporal and spatially autocorrelated, high-dimensional, noisy, and time-lagged signal. The signal is composed of thousands of voxels (voluminous pixels) with coordinates in space ( , , ) and time ( time-points). Correlations across space and time due to psychological (e.g., Visscher, Kahana, and Sekuler (2009)), neurovascular (Boynton, Engel, & Heeger, 2012) and physical (Smith et al., 1999) effects complicate the independence and linearity assumptions used to model the signal in each voxel. Furthermore, the observed blood-oxygen-level dependent (BOLD) signal is only indirectly related to the outcome variable of interest (neural activity), via the hemodynamic response function (HRF), which is normally modeled as a double gamma function.
In task fMRI, the BOLD time series for a voxel is modeled by weighting events, such as a sequence of pictures (e.g., dog, truck, face, etc.) presented to a study participant. In addition to nuisance regressors, one typically estimates a beta weight for each event (convolved with the HRF). We refer to this standard method as least squares all (LSA), which is unpenalized and plays a role analogous to OLS in Applications I and II.
However, for the reasons discussed above, collinearity in the time series can compromise parameter estimation (Mumford, Poline, & Poldrack, 2015), particularly in rapid event designs (e.g., trial duration of one or two seconds). One proposed solution, which we refer to as least squares separate (LSS), is to estimate a separate model for each event rather than a single model for all events (Rissman, Gazzaley, & D'esposito, 2004). Each model estimates one beta weight for the target event (i.e., trial) and a second shared beta weight for all other events (Turner, 2010). In practice, LSS produces better (less variable) estimates by being less sensitive to collinearity in the time series (Mumford et al., 2012).
We view LSS as analogous to the heuristics considered in Applications I and II. The TAL and TTB heuristics are insensitive to cue covariance. Specifically, cue validity and cue direction are estimated individually for each predictor. Moreover, we implemented these heuristics in a regression framework with a single beta weight (e.g.,̂) to derive a prior. In both models, simplification is achieved by forcing multiple predictors to share a single regression weight. Analogously, each LSS model forces all but the target event (out of potentially hundreds of events) to share a common beta weight.
Like TAL and TTB, we predicted that LSS would provide an effective prior for a penalized regression model because it provides a reasonable and robust starting point to move from when the data warrant. We predicted that a penalized regression model with an LSS prior would outperform both LSS (high ) and the LSA approach ( = 0).

Methods
To build a continuum of models between LSA and LSS, we include the weights derived from LSS as a target (i.e., prior) in the penalty term within a regularized LSA model. Thus, the weights from the LSS-prior model are estimated with the following objective: Paralleling our treatment of the decision heuristics as priors, Eq. (19) specifies a continuum of models ranging from LSA ( = 0) to LSS ( → ∞). For all models, is the activation time series for a single voxel; with spatial indices (i.e., coordinates in brain space) its notation is . Both LSA and LSS are known as massive univariate GLMs, since they model each voxel independently. For a given voxel, LSA estimates weights aŝ where is the BOLD response time series for a voxel and is the × design matrix with number of columns equal to the number of trials , with only one event per trial. (Each is an event for LSA, but this changes for LSS.) This means that a column in models a single event in the experiment. The number of brain scans or time-points is usually larger than the number of trials (events) ( > ) because more than one brain scan is acquired per trial. Quite commonly, a regressor models an event (such as stimulus presentation) with a boxcar function, that models the duration of the stimulus, convolved with a double gamma HRF (Boynton et al., 2012). We will not focus here on how the regressors that model the BOLD signal are constructed. Instead, we focus on the GLMs that receive those regressors as input.
The LSS model differs from the LSA model in that the matrix is replaced with a set of matrices 1 , … , which results in one GLM per trial: Each has dimensions × 2, where is the same as before. Each weight̂is selected as the first coefficient from its respective GLM, via multiplication by = [1 0]. Each is constructed as mentioned above, with the first predictor variable modeling a single experimental trial of interest (i.e., the th trial) and the second predictor being a nuisance variable [0 1] modeling all other trials in the experiment (i.e., all − 1 trials excluding trial ). The LSS-prior model in the main text useŝaŝ.

Simulated fMRI data
There were 1000 simulations performed for each of 9 different designs (see below) with varying levels of signal-to-noise ratio (SNR) and interstimulus intervals (ISI; time between events). The simulations were performed on modified code from the rsatoolbox , which can be consulted at: https://github.com/bobaseb/rsa_toolbox_lss/tree/develop/LSS_project Each simulation consisted of a cluster (i.e., region of interest) of task-sensitive signal voxels with observed data generated for all trials by weights ∈ R × × × , where = and each spatial dimension = 7. The weights were embedded in an array tripled along each spatial dimension ∈ R ×3 ×3 ×3 (i.e., the simulated brain). The weights for non-task-sensitive voxels in (i.e., those not in ) were set to zero. Scanner noise ∈ R ×3 ×3 ×3 had entries drawn i.i.d. from a centered normal distribution  (0, 2 ), where 2 = 10 000, and was added to to generate the observed signal: Thus, for a single voxel described by a set of spatial coordinates , , , we have data across time in ∈ R ×3 ×3 ×3 , represented as . For observations , the subset corresponding to voxels that are task-sensitive is denoted as . Notice the use of to generate simulated data instead of . In fact, there is no straightforward way to construct weights (embedded in ) to multiply with the set of matrices . To simulate spatiotemporal correlations in the data, the scanner noise was smoothed along its four axes for each run (two runs total, see below), using a Gaussian spatiotemporal smoothing kernel with full width at half maximum (FWHM) equal to 4 mm for the three spatial dimensions and 4.5 s for the temporal dimension. (Voxel size was set in millimeters at the default value of 3 × 3 × 3.75 in the rsatoolbox.) For each simulation, each coordinate of the effect center ( , , , as defined in the rsatoolbox) -where the signal voxels were placed inside the simulated brain -was uniformly sampled between 1 and 11 inclusive. Two separate runs ( = 2) were simulated on each of the 1000 iterations and each run had 20 repetitions of each of two stimulus types ( = 20 s and = 2). Simulating more than one run and stimulus type contributes to the ecological validity of the simulation, especially for studies that focus on classification (MVPA) where one run is used for training and another for testing. Repetition time (TR; duration for obtaining one full brain scan) was set to 1 s and event duration (ED, the duration of a stimulus on the screen in the MRI scanner) was set to 1.5 s.
A trial's duration is given by + . There are also ⌈ ∕3⌉ null epochs, randomly interspersed with the trials, where no stimulus is shown, each with a duration of + seconds. This kind of experimental design is common because it further helps reduce collinearity between trials and aid in the estimation of . Thus, for each run ⪆ 4 3 ×( + )+ , where is a temporal slack after the last trial that allows the BOLD signal enough time to decay. The exact number of time-points depends on the HRF model that was used (Boynton et al., 2012). This information is encoded in the design matrix . To sample the data-generating weights with correlations between (task-sensitive) voxels, we did the following: For each of the ∕ = 20 trials per stimulus, we sampled from  ( , ). Each entry 1 , … , 3 in mean vectors 1 and 2 (for each stimulus, respectively) was i.i.d., drawn from a normal distribution  (0, 2 ) for three levels of SNR ( 2 ∈ {10, 15, 20}). These were sampled for each iteration (a thousand iterations total) in each of the nine designs (Fig. 5) but kept constant across runs. The covariance matrix , with dimensions 3 × 3 , induces the correlations between task-sensitive voxels and was kept constant across runs but resampled on different iterations. It was drawn from a scaled Wishart distribution ( , )∕ with degrees of freedom = 3 . The symmetric positive definite matrix was constructed with ones on the diagonal and 0.7 for all off-diagonal values, representing a high degree of correlation between task-sensitive voxels. As presented in Fig. 5, the 3 × 3 design of the simulations had three levels of ISI ∈ {2, 3, 4} (in seconds) and three levels of SNR (as mentioned above).
After sampling all × 3 weights for a run, we have the object ∈ R × 3 . This matrix of weights (trials by voxels) was permuted along the temporal dimension and was arbitrarily mapped to the spatial coordinates of -and by implication, of too -such that → → , before applying Eq. (22).

Model scoring
Our evaluations of the models were done with the root mean squared error (RMSE) of eacĥfor each model (i.e., LSA, LSS, LSA-prior and LSS-prior models) with respect to the ground truth of each vector in : averaged across all the weights for task-sensitive voxels in and the 1000 iterations of simulated data:

Results
Our main prediction held (see Fig. 5). Across a range of task conditions, our penalized regression approach with outperformed both LSS (equivalent to large ) and LSA (equivalent to = 0) for intermediate penalty values of (Fig. 5). LSS provided an effective prior for our penalized regression. Replicating previous work, RMSE was lower for LSS than LSA, akin to less-is-more effects in which decision heuristics can best OLS (e.g., TTB in Fig. 3A).

General discussion
We looked toward human decision making to identify an effective prior for regularized regression and found that decision heuristics which disregard cue covariance information offer a number of advantages, such as robustness and interpretability. These heuristics offered a sensible starting point compared to the usual way of defining for most ridge regression applications (i.e., as the zero vector).
Here we have presented three different types of applications in over twenty different datasets, germane to the fields of decision making, fMRI analysis, and statistical modeling. We have validated the utility of using heuristics like TAL and TTB to construct , as well as using other algorithms which lack a normative foundation and parallel the operation of heuristics -like LSS in the case of fMRI time series modeling.
Three main benefits of no-covariance priors are worth highlighting. First, predictions using a no-covariance prior are likely to provide at least as good, if not better, predictions than a vector of zeros as coefficients. Examples of the TTB-prior model outperforming other models are seen in Fig. 3A and in the lower RMSE values obtained in Application III.
Second, catastrophic failure of the model is avoided for extremely high values of , whereas in normal ridge regression, convergence to the zero vector for high penalty values results in essentially random guessing for the comparison and classification tasks presented here. Convergence toward very small weights may also create implementation issues on digital computers which have limited precision. For example, differences in how floating point numbers are represented in supporting software libraries could reduce the reproducibility of results.
Third, this class of priors has theoretical significance. On the one hand, the model class introduced here further integrates the notions of heuristic decision making and full information algorithms along one continuum of models (as in (Parpart et al., 2018)). Choosing heuristic priors that contrast compensatoriness of the environment, like TAL and TTB do, helps us interpret both the solutions of our models and the environment itself in an easier way than is possible with OLS or the Zero prior model. Likewise, the solution of the encompassing model could be understood in terms of deviations from the heuristic prior. Other informative comparisons could be made to the OLS solution, including how it diverges from the heuristic prior. Finally, our framework provides a way to simulate fMRI data with LSS weights, previously not possible due to the arbitrariness of defining weights for the LSS nuisance variables (see Application III).
The theoretical contribution of this model class is worth emphasizing since it also provides a lens on why heuristics are useful in the first place. The priors offered by heuristics confer robustness; unlike the Zero prior, they embody a sensible inductive bias. This dovetails with why heuristics can operate defeasibly. Speculatively, humans and other cognitive agents may have evolved to implement these priors as a rule. Like Occam's razor, humans also show bias toward simple solutions for many decisionmaking tasks (Gigerenzer, Todd, & TAR, 1999;Kahneman et al., 1974). With expertise (i.e., acquiring more data), the solutions can change (Hornsby & Love, 2014), but initially, very general strategies like assuming independence among covariates have been documented (Bröder, 2000;Gigerenzer & Goldstein, 1996). Of course, this is only one notion of expertise. Other notions could include less effort during inference or rule application, finding appropriate features of a domain and ease of searching for new strategies or creating new ones. Furthermore, experts are not even guaranteed to perform better than statistical techniques (cf. (Meehl, 1954)).
Instead of being all-or-none, heuristic use may move along a continuum (Newell, 2005) as a function of prior strength and experience. Indeed, heuristic use in human decision making is not without its caveats (Newell, Weston, & Shanks, 2003), as is their supposed frugality (Bobadilla-Suarez & Love, 2018;Dougherty, Franco-Watkins, & Thomas, 2008). What is clear is that no heuristic will be best in all environments (cf. no free lunch theorem). Instead, each heuristic is best suited to certain environments and can be seen as embodying a prior that reflects beliefs about the environment.
Of course, this non-universality raises the critical question of how does one choose which heuristic to use? This question closely mirrors the inductive challenge of choosing a prior for a Bayesian model. A general solution to choosing the best heuristic is computationally intractable (Rich et al., 2021), though effective solutions have been offered (Rieskamp & Otto, 2006;Scheibehenne, Rieskamp, & Wagenmakers, 2013). Intuitively, if one believed, for whatever reason, that an environment was governed by numerous additive factors, then a heuristic like TAL would be a good strategy to adopt. The problem of strategy selection closely relates to the problem of meta-learning, or learning to learn, in which one determines how to choose hyperparameters, architectures, general strategies, etc. that will perform well in a task (Schweighofer & Doya, 2003). With enough data one can test which heuristic performs best on a sub-sample. However, in the low data regime this might not be possible. Our results show the differences between TTB and TAL used as a prior may not be too significant but future work should explore this angle.
With reference to models of human decision making, this class of algorithms has further potential. Referring back to the roots of regularized regression, Tikhonov (1943) initially constructed this type of regularization in a more general form: where has been replaced with a matrix . This enables the implementation of different penalty values for different directions in weight space. Admittedly, choosing would require knowledge of the data. Our results suggest there might be some advantage in this kind of stepwise approach, where one model's output provides another model's prior. From a psychological point of view, this would enable modeling attention through the scaling of dimensions (Nosofsky, 1986). Although empirical studies show humans usually employ attention solely along individual dimensions (i.e., the diagonal of (Jones, Love, & Maddox, 2005;Kruschke, 1993)), other applications (like our fMRI example) could benefit from this generality (Bobadilla-Suarez, Ahlheim, Mehrotra, Panos, & Love, 2020). Generalizations of such regression algorithms include adding a matrix that puts weights on observations themselves (van Wieringen, 2015) or even using heuristic regularizers for more complex models like neural networks. As in all modeling endeavors, the researcher should make clear how the model is intended (Jones & Love, 2011). For instance, a penalized regression approach could be proposed and evaluated as a normative account of what should be done, a high-level description of what people actually do, or an algorithmic account of the processes people engage in. We suggest further work on expertise (e.g., transitioning from novice to expert) could engage with any of the mentioned modeling strategies.
Furthermore, the models presented here provide only point estimates of̂, but there is also no obstacle in expanding them to the Bayesian setting to obtain the full posterior distribution as well. In fact, it is well known that ridge regression, like LASSO regression, has a Bayesian interpretation (Friedman, Hastie, & Tibshirani, 2001;Parpart et al., 2018;Tibshirani, 1996). Our twostep approach engages in a double counting of the data (cf. (Zou, 2006)) which could suffer from bias and undue confidence in predictions. A Bayesian formulation could address this potential issue, providing new insights on why our two-step approach works and in which environments. This exciting future direction could expand the reach of our approach by placing it on a normative footing, enabling inquiry into the models' confidence.
In conclusion, we find that priors motivated by decision heuristics are valuable both methodologically and theoretically. Assuming independence among predictor variables offers a reasonable prior or starting point in most situations. These priors are themselves data-informed models that perform robustly when the penalty value (i.e., prior) is overly strong. Although ridge regression may not routinely suffer from extreme penalty values in practice, use of the TAL and TTB priors do not appear to have any significant downside and may be judged a more sensible choice and perhaps more akin to how people learn than ridge regression's null vector prior. Linking insights across fields as disparate as decision making and advanced methodologies for fMRI data analysis, we are confident that these robust priors for regularized regression will find even further utility in other fields, surpassing the theoretical contributions that we have hinted at here.

CRediT authorship contribution statement
Sebastian Bobadilla-Suarez: Coded all the analyses and simulations, derived the objective function for Equation 16 along with its Newton-Raphson estimate, Wrote the initial draft, Interpreted the results and provided critical comments on the manuscript. Matt Jones: Interpreted the results and provided critical comments on the manuscript. Bradley C. Love: Developed the study concept and derived the objective function for Equation 1, Interpreted the results and provided critical comments on the manuscript.