Can language representation models think in bets?

In recent years, transformer-based language representation models (LRMs) have achieved state-of-the-art results on difficult natural language understanding problems, such as question answering and text summarization. As these models are integrated into real-world applications, evaluating their ability to make rational decisions is an important research agenda, with practical ramifications. This article investigates LRMs’ rational decision-making ability through a carefully designed set of decision-making benchmarks and experiments. Inspired by classic work in cognitive science, we model the decision-making problem as a bet. We then investigate an LRM’s ability to choose outcomes that have optimal, or at minimum, positive expected gain. Through a robust body of experiments on four established LRMs, we show that a model is able to ‘think in bets’ if it is first fine-tuned on bet questions with an identical structure. Modifying the bet question’s structure, while still retaining its fundamental characteristics, decreases an LRM’s performance by more than 25%, on average, although absolute performance remains well above random. LRMs are also found to be more rational when selecting outcomes with non-negative expected gain, rather than optimal or strictly positive expected gain. Our results suggest that LRMs could potentially be applied to tasks that rely on cognitive decision-making skills, but that more research is necessary before these models can robustly make rational decisions.


Introduction
Transformer neural network-based language representation models (LRMs), such as the Bidirectional Encoder Representations from Transformers (BERT) [1] and the Generative Pre-Trained Transformer (GPT) series of models [2,3], have led to impressive advances in natural language understanding. They have significantly advanced state-of-the-art performance on a variety of natural language tasks, ranging from information extraction [4] and semantic role labeling [5], to text summarization [6], cross-lingual and multi-lingual understanding [7], and question answering [8]. Variants of such models [9,10,11,12] currently underlie the most successful systems on competition leaderboards hosted by the Allen Institute for Artificial Intelligence for several important benchmarks [13], [14]. Domain-specific versions of these models have also achieved impressive performance in their respective domains (e.g., scientific literature, patents and intellectual property, and biology). Representative examples include Patent-BERT [15], DistilBERT [16], BioBERT [17], DocBERT [18], K-BERT [19], and SciBERT [20]. More recently, they have also been applied in multi-modal settings involving both text, and visual modalities, such as video [21], [22].
Owing in part to the close connection between language and cognition [23,24], a growing body of research is seeking to deduce the cognitive abilities (or lack thereof) of LRMs [25], [26]. There are both theoretical and practical reasons for this interest. The latter is important because these models are continuing to be integrated into, or otherwise used for, real-world applications and architectures in multiple enterprises and domains [27,28,29,30]. The former is also important and may be attributed to the empirical success and rapid advancement of these models. Particularly, as these models continue to get larger, and are proving to be capable in ways that had not been initially conceived [31], there is rising interest in their fundamental properties, such as the dependence of their performance on size and number of parameters [32], their robustness [33], including susceptibility to various flavors of adversarial attacks [34], and the amount of knowledge 'encoded' into their learned representations [35]. This line of research is detailed further in Section 2, where we discuss related work.
Along these lines, it is unclear if LRMs can be trusted to make approximately rational decisions, even when the outcomes are defined but are uncertain. In both behavioral and decision science, a number of experiments over the years have sought to test this ability in humans [36,37,38,39]. Classic work by Kahneman and Tversky showed, for instance, that people tend to exhibit loss aversion (prospect theory), and are not completely rational [40], [41]. Despite this sensitivity to loss, however, prospect theory still predicts that an ordinary person would still choose gain over loss, if the gain substantially outweighs the loss in an equi-probable bet. Furthermore, people would not willingly choose to lose, if there was no possibility of gain attached to the choice. It is an open question whether LRMs take the same 'commonsense' approach to making decisions, or (in contrast) they are prone to making decisions that would be considered extremely irrational in humans, such as actively pursuing loss (or to a lesser extent, zero gain) when a clear alternative is available with minimal risk.
This article proposes a detailed set of research questions for empirically investigating the rational decision-making abilities of established transformer neural network-based LRMs. Inspired by longstanding work on behavioral science research mentioned above, the problem of making rational decisions is framed as one of testing a given model's ability to think in bets and choose the outcome with the maximum expected gain. Specific contributions are enumerated below: 1. We propose a novel set of research questions for understanding LRMs' rational decision-making abilities. To the best of our knowledge, this is the first such attempt to quantify LRMs' ability to make rational decisions by experimentally probing their capacity to 'think in bets.' 3 2. We construct and present a novel set of decision-making and preference elicitation benchmarks for empirically investigating our research questions in a robust manner. The benchmarks are designed to actively preempt common issues with such models, such as superficial pattern matching ('shortcut learning') and dataset bias, which has also been found to be severe in other deep learning applications, such as computer vision [42,43,44]. 3. We present a detailed and replicable methodology underlying the experimental study for investigating the presented research questions. Our methodology aims to control for several important factors that could serve as explanations for subsequently observed findings. Where applicable, we present new metrics and evaluation protocols to address such issues. At least one of these metrics, called Belief Conditioned Accuracy (BCA), is a novel metric that is specifically designed to quantify an LRM's rationality on a decision-making problem, after conditioning on a theoretically important predictor. 4. We conduct an extensive experimental study investigating each of the proposed research questions. Each study is accompanied by a detailed set of statistics, and appropriate baselines. We conduct the study on four transformer-based LRMs that are established models already being incorporated in several industrial products and services. As noted earlier, we also emphasize robustness (through multiple evaluation metrics and statistics) in the experimental design itself. Using multiple evaluation metrics, we find that our core conclusions are largely consistent with one another.
The rest of this article is structured as follows. Section 2 discusses relevant related work, while Section 3 enumerates the specific research questions that fall within the scope of this article. Section 4 details the materials and methods underlying the study, and Section 5 follows with the key results and findings for each of the research questions in Section 3. Section 6 contextualizes these re-4 sults with a broader discussion. Section 7 provides some guidance on promising avenues for future research before concluding the article.

Related Work
This article is primarily influenced by two broad bodies of research: fundamental research on transformer-based LRMs, and experimental studies investigating the properties of these LRMs. The latter area is especially relevant to our goals. We discuss each of these in turn below, with specific focus on work that is best related to our research objectives. Additionally, our experimental methodology, and the manner of benchmark construction, are influenced significantly by decision-making experiments in the behavioral sciences [41,36,39].
However, these experiments were mainly conducted on humans (and at times, in domain-specific settings such as science [37]). This article seeks to follow similar principles in benchmark construction, but applies the benchmarks on LRMs. Because the article considers LRMs as the primary objects of study, we begin by describing some key models that are also employed in the empirical study in this article, followed by detailing other recent work on understanding the properties of these models, and transformer-based models more generally.

Transformer-based Language Representation Models (LRMs)
As we noted in the previous section, language representation models (LRMs), which are also called neural language models, have achieved great success on a variety of natural language understanding tasks over the last half decade. An early and influential LRM is Bidirectional Encoder Representations from Transformers (BERT), which uses a novel attention mechanism to obtain rich 'pretrained' representations of language from a large corpus of text. Pre-training is followed by task-specific 'fine-tuning' that allows it to get state-of-the-art performance (at the time) on specific tasks, such as question answering, without requiring full re-training from scratch. In experiments, BERT was found to obtain a score of 80.5% on the General Language Understanding Evaluation (GLUE) benchmark [45], an improvement of 7.7% over the previous best performing model. Similarly, it achieved an F1 score of 93.1% on the Stanford Question Answering Dataset (SQuAD) v1.1 benchmark [46] and 83.1% on SQuAD v2.0 [47], with improvements of 1.5% and 5.1% compared to the previous leading model, respectively. Moreover, the pre-trained BERT model was released publicly and is amenable to fine-tuning on other tasks. We draw on this capability in the proposed work.
A more mature version of BERT is the Robustly Optimized BERT Pretraining Approach (RoBERTa) [48]. RoBERTa is structurally the same as BERT.
However, RoBERTa improves the training process on some key fronts, such as a bigger batch size, more extended sequence, and longer training. RoBERTa also removes the next sentence prediction objective and introduces the dynamic masking strategy. Therefore, compared to BERT on their published GLUE and SQuAD performance, RoBERTa shows significant improvements and obtains new state-of-the-art results on four of the GLUE tasks. Like BERT, the pre-trained RoBERTa model was released and is amenable to being fine-tuned.
A more advanced version is the Decoding-enhanced BERT with disentangled attention (DeBERTa) [9]. DeBERTa is structurally similar to BERT and RoBERTa. DeBERTa also introduces several novel techniques to improve performance even further. First, DeBERTa uses a disentangled attention mechanism, where two vectors are used to represent the content and relative position of each word, and correspondingly, disentangled metrics are used to compute the attention weights. Second, during pre-training, DeBERTa uses an enhanced mask encoder to combine the absolute positions in the decoding layer to predict the masked tokens. These novel methods improve DeBERTa's performance on benchmark tasks. Compared to the RoBERTa-Large model, DeBERTa improves on SQuAD v2.0 by 2.3%, on Multi-Genre Natural Language Inference (MNLI) [49] by 0.9%, and on Large-scale ReAding Comprehension Dataset From Examinations (RACE) [50] by 3.6%. The pre-trained DeBERTa is also publicly available and amenable to being fine-tuned.
Another version is the Transformers for Longer Sequences (BigBird) model [51]. BigBird deploy a sparse attention mechanism to reduce the quadratic 6 dependency on sequence length, to linear dependency. The quadratic dependency on sequence length (in terms of memory) is one of the core limitations of transformer-based models, and necessary because of the full attention mechanism. As a result of its novel sparse attention mechanism, BigBird can handle up to 8x longer sequences using similar hardware, while demonstrating impressive improvements on question answering benchmarks.
Although the four models above are cited as representative examples, LRMs have continued to advance and become ever larger even in the last two years.
Examples of recent LRMs include the Generative Pre-trained Transformer 3 (GPT-3) [3], the Language Models for Dialog Applications (LaMDA) [52], and the Scaling Language Modeling with Pathways (PaLM) [53]. However, due to their track record over the last five years, and their more manageable size, BERT-based models have tended to be incorporated in real-world systems, including within Google's own search engine [54]. Therefore, for the experimental studies in this article, we use the four models detailed earlier (BERT, RoBERTa, DeBERTa, and BigBird). Our benchmarks and methodology are applicable to other models that are capable of natural-language question answering, but we leave an investigation of the bigger models for future research.

Understanding the Properties of LRMs through Experimental Studies
Owing to the success of such LRMs, a recent line of work has emerged on understanding their properties using rigorous empirical methodologies, many inspired by research first conducted in the behavioral sciences. Prior work following BERT, for instance, has proposed approaches to better study the knowledge encoded within these deep transformer-based LRMs. Examples include 'fill-inthe-gap' probes for understanding the masked language model facility in LRMs [55,56], probing of other classifiers that take different BERT representations as their feature-inputs [57,58], deeper analysis of the self-attention weights in the LRMs [59,60], and even a check-list style approach for comprehensively evaluating the linguistic abilities of a BERT-based model [61]. Detailed evidence suggests that BERT-based models seem to be encoding a 'hierarchy' of linguistic 7 features, with surface features at the bottom, syntactic features in the middle and semantic features at the top [62]. Using massive corpora of pre-training text, the model is able to learn such a hierarchy implicitly without requiring explicit training labels.
While some work has found that information can be recovered from BERT's token representation [63], the model still has trouble 'understanding' concepts that are relatively natural to humans, such as negation and basic numeracy [25].
Like many other machine learning models, the model can also be overly confident in some of its inputs, and is susceptible to problems of both generalization and adversarial attacks [34], [64], [65], [66], [43]. Furthermore, several experiments have demonstrated that, although BERT effectively encodes information about relations, entity types, relations, semantic roles, as well as proto-roles, it can lose some of its robustness in the face of basic named entity replacements [26].
This article contributes to this line of work by specifically investigating preference elicitation and rational decision-making abilities of such LRMs. We address the latter by posing bet questions to an LRM, and assessing its ability to think probabilistically, in terms of expected gains or losses, both when it is exposed to example bets (and allowed to 'fine-tune' on them) and when it is not. To the best of our knowledge, this is the first study to propose such an investigation, although there has been recent work on applying LRMs to the sequential decision-making problem [67], which is reminiscent of planning (rather than behavioral decision-making, as studied herein). However, much more recently, language models have been playing a growing and prominent role of late in cognitive science research [68,69]. This work is intended to add to this growing body of research.

Research Questions (RQs)
This article proposes to investigate three specific research questions (RQs), the first of which investigates the preferences of neural language models when provided with pairs of items that are of 'high' or 'low' value, while the other two 8 investigate their ability to make rational decisions under different assumptions and experimental conditions. 1. RQ1 (Preference Elicitation): Can LRMs be trained to prefer a highvalue item (e.g., a diamond) over a low-value item (e.g., a plastic pen), where value is understood in commonsense economic terms?
2. RQ2 (Thinking in Bets Without Task-Specific Fine-tuning): Are LRMs able to rationally bet on outcomes with higher expected gain without first being fine-tuned on such bet questions?
3. RQ3 (Thinking in Bets With Task-Specific Fine-tuning): Are LRMs able to rationally bet on outcomes with higher expected gain after being fine-tuned on such bet questions?
While the first question is not (in itself) central to the goals of this work, it is an important prerequisite for investigating the other two questions. This is because the concept of rationality is linked to an agent's belief, or expectation, about value. For example, if an agent believes that an object A is more valuable than B, then given a bet with equi-probable outcomes (such as a coin toss), it is rational for the agent to 'bet' on the outcome with A as the prize. In other words, the agent's preference of A over B influences our judgment of whether it subsequently makes a rational bet or not. Hence, understanding whether (and under what conditions) the language model's preferences align with our own is an important research question to address prior to investigating the model's ability to bet on rational outcomes. In exploring this question, we also compare the trained model's ability to that of a 'default' LRM that has not been trained specifically for preference elicitation (but that still performs well on general commonsense question answering tasks) to quantify the effect of training itself on preference elicitation.
The second question directly considers the issue of whether LRMs are able to think in bets without first being fine-tuned on bet questions. As discussed subsequently, we refer to this as 'task-specific fine-tuning' wherein a model has been fine-tuned on a training set of questions that mirrors the purpose of the RQ, compared with the 'default' version, as mentioned above. However, we also investigate whether an LRM that has been fine-tuned for the preference elicitation task (RQ1) and performs well on it, is able to 'naturally' think in bets, even without first being fine-tuned on bet questions.
The third question investigates whether the model is able to think more effectively in bets, both in absolute terms and relative to RQ2, once it undergoes such task-specific fine-tuning. Although not stated directly in the question itself, we also investigate the LRMs' generalization on bet questions that are structurally similar but have different surface form compared to questions that were used for task-specific fine-tuning. By structurally similar, we mean that the mathematical form of the bet, including the number of outcomes, and the probabilities associated with the outcomes, remains the same. An example would be two bets of the form where one involves tossing a coin, and the other of which involves randomly picking a card (from the standard 52-card deck). An outcome is then associated with whether the coin comes up heads or tails (for the former), or whether the randomly picked card is black or red (for the latter).
Assuming that the bet-wager and outcomes are identical in the two cases, the bets described above are structurally similar, but with different natural language descriptions associated with them. The third question attempts to quantify whether, and to what extent, the LRMs' decision-making ability erodes when the surface form of the question changes to one that it has not seen during fine-tuning.
As noted in Section 2, our research questions are heavily inspired by similar experiments in decision science and psychology, many of a classic nature [40]; namely, we seek to understand a language model's rationality, and its ability to seek outcomes with maximum expected gain, by studying its preferences on a carefully designed set of prompts. It is important to design the prompts to control for a range of problems that are known to occur with such language models, including their sensitivity to format [70], and their propensity to achieve high performance through advanced, but ultimately superficial, statistical pattern matching [42]. Hence, the experimental design, benchmark construction, and evaluation methodology are critical elements of the study, and are extensively detailed in the following section.

Language Representation Models (LRMs)
As discussed in Section 2, many of the recent advances in language representation models (LRMs) are based on transformer neural networks [71]. In some instances in the literature, these are referred to as language representation learning models, or even neural language models. We adopt the uniform terminology of language representation models in this article, with the understanding that we are primarily interested in the recent neural models.
LRMs, such as BERT [1] and the GPT [2] series of models, have been found to generalize on an impressive range of language understanding tasks, including machine translation and question answering [72,73,74,75]. In the remainder of this article, we uniformly use the term LRM to refer to the models that are used to answer 'multiple choice' prompts or questions by selecting one answer from a set of candidate choices.
To be applied to specific NLP problems, these models, which are pre-trained on a large corpus of text before they are publicly released, are typically also fine-tuned on an additional smaller data set to optimize them for the task at hand. For example, if BERT were to be applied to the problem of Named Entity Recognition or NER (automatically extracting named entities, such as people, places and organizations, from text), the pre-trained version would have to be fine-tuned on a 'training' set of clearly defined NER inputs and outputs. Finetuning takes a much smaller amount of time compared to pre-training. This makes pre-trained LRMs a powerful asset in the NLP literature because they can be used as a 'base' model for a wide range of tasks and data sets, a facility that we rely upon for the decision-making experiments herein.
Owing to its training on a large body of text, the pre-trained model can be fine-tuned to 'score' a natural language sentence based on its likelihood of being a plausibly constructed sentence. The higher the score, the more plausible the sentence. Impressively, the score correlates not just with real-world syntactic usage but also plausible semantics, depending on the background corpus on which the model was pre-trained. For instance, if the model was pre-trained on a general corpus, such as Wikipedia or Google Books, nonsensical sentences would tend to be given much lower scores by the model. However, in some cases, the pre-training corpus is domain-specific, such as with the BioBERT pre-trained model [17] or social media-based pre-training [76]. Such models will normatively assign higher scores to biological and social media sentences, respectively. We only use models in this article that were pre-trained on general corpora, such as Wikipedia and news articles.
Although there are many viable transformer models (and their variants) available at the time of writing, we selected four models for our studies, first introduced and discussed in Section 2: BERT [1], RoBERTa [48], DeBERTa [9] and BigBird [51]. We emphasize again that RoBERTa is fundamentally similar to BERT, but is often treated separately because of its (much) higher performance over the original BERT release owing to its robust optimization, and other important engineering innovations [48]. Our rationale for selecting these four LRMs is that they are established models that have been rolled out in a range of commercial and outward-facing products, including the Google search engine [54] and Amazon Web Services [77]. Many technical and domain-specific variants of these models have also been developed and deployed, including SciB-ERT [78], BioBERT [17] and AlBERTa [79].
An important commonality between these four models is that they all have pre-trained versions publicly available, but can also be fine-tuned on additional question answering data sets. Since the training data set used for fine-tuning depends on the experiment and research hypothesis, we specify the data set used for fine-tuning when discussing the experimental methodology for the corresponding research hypotheses. Next, we describe the specific manner in which each of these LRMs can be applied to the Multiple Choice Question Answering (MCQA) problem, which is of central interest in this article.

Multiple-Choice Question Answering (MCQA) using LRMs
In this section, we introduce some basic formalism on MCQA instances, and on the specific methodology that we use to obtain an LRM's prediction for a given instance. An MCQA instance formally consists of two elements: a 'question' prompt q and a set C of n 'answer' choices {c 1 , c 2 , ..., c n }. We assume, without loss of generality, that each of the choices and q is represented as a string. Furthermore, it is usually assumed that exactly one of the choices in C is 'correct'. Given a set of MCQA instances (referred to as an MCQA benchmark ), the goal of a question-answering system, such as an appropriately fine-tuned LRM, is to predict the correct choice for the prompt.
One approach by which an LRM can be made to answer an MCQA instance is as follows. First, the question prompt q is concatenated with each of the choices c i in turn. This yields n question-answer pairs, where a pair p i = concatenate(q, c i ). Next, each of the n pairs is fed into the model in turn (i.e., independent of one another) during the fine-tuning phase, when the correct answer can be revealed to the model. Specifically, if the c i used to form p i is the correct choice, p i is labeled as 1 (otherwise it is labeled as 0). Given such a 'training' set, the model is fine-tuned to minimize the cross-entropy loss as is standard for MCQA problems.
The fine-tuned model can then be evaluated using a similar methodology on unseen MCQA instances, for which it needs to predict the correct answers. First, we convert such a 'test' MCQA instance to a similar input structure, as used during fine-tuning, by concatenating each choice c i to the question prompt q (to obtain pair p i ). Next, the model is provided with each p i independently, and outputs a score for each such pair. The score is assumed to be proportional to the model's belief in that pair being labeled as 1 in the underlying ground-truth.
Because the model's score is not necessarily normalized, we use the sigmoid function 1 , to normalize each score to the range [0,1]. Although just the highest-scoring choice could be selected as the model's prediction, there is an alternative mechanism available for selecting (and evaluating) the predictions in a decisionmaking context, as later discussed. For this reason, it is more appropriate to say that, depending on the specific experiment and research question, a predicting function is applied to the n normalized scores (corresponding to the n choices in an MCQA instance) to yield the model's prediction for that instance.
As a concrete example, consider the MCQA instance in Table 1. The finetuned LRM would be given each question-answer pair in turn. If the predicting function is simply to select the choice for which the model outputs the highest score, and this choice happens to be 'This statement is true: airplane is more expensive than pen', then the model would (correctly) select the choice 'airplane is more expensive than pen' as its prediction, given the prompt 'This statement is true:'.

Prompt
Choice Question-Answer Pair This statement is true: airplane is more expensive than pen This statement is true: airplane is more expensive than pen pen is more expensive than airplane This statement is true: pen is more expensive than airplane airplane and pen have the same value This statement is true: airplane and pen have the same value Table 1: An example of a Multiple-Choice Question Answering (MCQA) instance (prompt and choice-set) and the concatenated pairing of the choices in turn that is used to evaluate the LRMs in this paper.
We place no constraints at present on the predicting function: it may select zero, one, or more than one, choice, as the prediction for a MCQA instance.
In the next section, we describe two plausible choices for the function, one of which is to just select the highest-scoring choice.
An alternative approach to fine-tuning a model for question answering is to concatenate the question prompt q with all of the choices (c 1 , c 2 , ..., c n ) together, along with a separator between each of the choices i.e., (1, 2, ..., n).
This yields a single 'complete' multiple choice question, which is denoted as mcq = concatenate(q, 1, c 1 , 2, c 2 , ..., n, c n ). Instead of 1 or 0, the expected out- put (or label) of the model should be the string of the correct choice, denoted as c T . The actual fine-tuning process is similar in that the model generates string output given the mcq, and the cross-entropy loss between c T and this string output is used in the optimization. However, while this can work well for 'generative' question answering problems, models such as BERT are better suited for discriminative problems (of which the MCQA problem is one) and typically adopt the first approach. While language models, such as UnifiedQA, have also been applied to generative QA problems [12], we leave a generative evaluation of a model's decision-making ability for future work and assume the first or 'discriminative' approach in the rest of the paper.

Predicting Functions: Standard Method and Threshold Method
An obvious choice for the predicting function that we had mentioned briefly earlier, applicable when exactly one 'correct' prediction is desired from the model, is to select the choice c i corresponding to the highest normalized score.
We call this function the standard method, as it is the method favored in much of the QA literature where exactly one choice is correct and all other choices are incorrect. An example presenting the method in action was earlier presented in the context of Table 1.
However, in the decision-making benchmarks that are considered in this article, the assumptions about correctness are more nuanced. For example, while our benchmark construction (subsequently described in Section 4.3) always guarantees that there is a single 'optimal' answer, it is not always the case that all other answers are equally sub-optimal. Some choices provided with a bet question may be associated with positive expected gain (even though they may not be associated with the highest positive expected gain, which would be the case for the optimal answer) while others may be associated with zero, or even negative, expected gain. An important empirical objective in this paper is to determine whether the model is able to understand these differences, especially when it selects sub-optimal choices.
Hence, to more holistically evaluate the model's decision-making ability, we also consider a second predicting function called the threshold method. As the name suggests, instead of simply selecting the choice with the highest score, this method involves selecting all choices (as predictions) that lie above a threshold. In principle, this methodology is similar to that employed in a wide range of practical machine learning applications that require the careful selection of a threshold in order to optimize a non-trivial quality metric (such as F1-score) on a multi-label problem [80]. Similarly, when describing the benchmark construction in Section 4.3, we systematically consider how to evaluate the quality of a model when using this method, but for now, it suffices to say that there is more than one reasonable way to construct a 'ground truth' against which to evaluate an LRM's (multi-label) predictions when using the threshold method. We define such a 'binary' ground-truth as stating which of the 2 n possible power-set predictions should be considered as correct or incorrect, on the basis of which an accuracy metric can always be computed for a model. For example, one choice of ground-truth might consider as correct any combination of choices (that a model selects, using the threshold method) that yields positive expected gain, while another (less conservative) ground-truth may only decide to test for consistency (i.e., it may only penalize a set of selected choices that are directly contradictory, such as 'bet on heads' and 'do not bet'). The rationale and specific rules governing ground-truth construction for a corresponding benchmark will be provided in Section 4.3.
Because there is more than one way to judge the quality of such multi-label predictions, an 'optimal' choice for this threshold is not a fixed value, and depends not only on the manner (including the choice of ground-truth) in which the model's performance is being judged, but also on the benchmark itself. Furthermore, a threshold that works well for one model (and for a given quality metric) may be sub-optimal under a different experimental condition. For all of these reasons, instances in each of our benchmarks are always partitioned into a train, development and test set. Where applicable, the train set is used for fine-tuning, and without exception, the test set is always used for evaluating all models under a given experimental condition to ensure fair comparison.
Similarly, when using the threshold method as a predicting function, the development set is always used to determine the 'optimal' threshold, given an LRM, and choice of ground-truth against which the LRM's predictions will be judged.
To discover such a threshold, which is technically a hyperparameter that takes values in [0,1], we first do a simple grid search in that range using increments of 0.01. In the event that more than one threshold value achieves the maximum performance on the development set, we select the median of the values that achieve this maximum as the expected optimal threshold for that experiment.

Fine-Tuning LRMs for MCQA
As described earlier, the pre-trained versions of the LRMs need to be finetuned on appropriate 'training' sets before they can be applied on tasks with a specific structure. MCQA is an example of such a task (and is the primary focus of this article) but other common examples in the NLP literature include named entity recognition [81] and information extraction [4]. The four pretrained LRMs that we fine-tuned for the MCQA experiments in this paper are BERT, RoBERTa, DeBERTa and BigBird, all of which were introduced earlier.
Note that these LRMs have variants in the HuggingFace repository that we used for accessing and fine-tuning the models. The specific variants that we used are BERT BASE [82], RoBERT a BASE [83], DeBERT a BASE [84], and Although we could directly fine-tune each of these pre-trained models on the train set of the MCQA benchmark that we construct, one of our empirical goals is to understand whether such models, if fine-tuned on a 'general purpose' MCQA benchmark, are able to exhibit reasonable decision-making ability as a natural consequence of such fine-tuning. One such benchmark that is widely used in the community is the Situations With Adversarial Generations (SWAG) data set [73]. SWAG is a commonsense benchmark that contains MCQA instances on grounded commonsense inference and physically grounded reasoning.
By fine-tuning each of the four LRMs on SWAG, we can test whether good performance on such commonsense tasks necessarily entail good performance on decision-making and preference elicitation problems expressed using everyday language and objects. Since these SWAG-based fine-tuned models form a natural basis of comparison to models that are further fine-tuned to handle our decision-making benchmarks, we refer to them as the default models. For example, the default BERT model is used to refer to the BERT model that has been fine-tuned on the training partition of the SWAG benchmark.
For the fine-tuning itself, we use a batch size of eight and fine-tune each of the four models for three epochs (and a total of 27,500 steps) each on the 73,546 MCQA instances in the SWAG training set, using a learning rate of 5e-5. Following the fine-tuning, we verified that, on the SWAG validation set, the accuracy of the default BERT, RoBERTa, DeBERTa and BigBird model is 77%, 79%, 85% and 81%, respectively. These results are consistent with previously published results and confirm that the models are indeed able to achieve good performance after fine-tuning. Finally, we uniformly use a batch size of 32 and a learning rate of 5e-5 for any other (i.e., non-SWAG) fine-tuning, described below, in our experiments.
Because the default model can be further fine-tuned on another MCQA benchmark with a similar structure, we use it as the 'initialization' for finetuning an LRM on the train set of our own benchmark (which depends on the specific research question). We mnemonically refer to such a model as a taskspecific fine-tuned LRM. Since each of our benchmarks is always partitioned into train, development and test sets, we use the train set for the actual fine-tuning, and we stop the fine-tuning once the model has achieved an accuracy of 90% on the development set. A key point that we emphasize here is that, unlike the default model (of which there is only one unique model per LRM, since it is never fine-tuned on a decision-making benchmark), the task-specific fine-tuned model depends on the benchmark that was used for further fine-tuning. Hence, there is a unique task-specific fine-tuned model per LRM and benchmark. Finally, when evaluating a (default or task-specific fine-tuned) model using the threshold method, the development set is 're-used' for determining the expected optimal threshold for that model using the grid-search procedure described earlier.

High-Value and Low-Value Sets of Items
We manually create two sets of 'high-value' and 'low-value' items to facilitate our experiments. These items are tabulated in Table 2. We created these sets with the intent that any regular person would be able to distinguish these items fairly easily, especially if asked to do so in terms of (difference in value in) dollar amounts. We recognize that 'value' can be understood in different ways, and can even be contextually dependent e.g., a 'low-value' item can always become high-value in the right set of circumstances, and vice versa. We adopt a commonsense, everyday view here, with value best thought of in economic terms. This is a standard premise in the decision-making literature, going back to the pioneering behavioral psychology experiments devised by Kahneman and Tversky [86]. Also, it is the distinction (or relative difference) between high and low value items that we control for in our experiments, not the absolute value of these items. Furthermore, in our experimental design, we take into account the potential concern that an LRM may not understand 'value' in quite the way described above. We present questions using several different templates, including making the economic aspect of value explicit in one of these templates.
This construction is discussed further in the next section.

Value Questions
Given the high-value items and low-value items described in the previous section, RQ1 seeks to determine whether an LRM prefers (at least, on average) a high-value item over a low-value item. One manner in which we can do this is by prompting an LRM to choose the more 'valuable' item from a given pair of items (one of which could be high-value and the other of which could be low-value). More precisely, we could design a template, defined as a partial instance with placeholders that is converted into an actual instance by being appropriately instantiated.  These four templates are listed in Table 3, along with an example instantiation using a high-value and low-value item combination from depending on the template, that 'h' is more valuable or expensive than 'l'). Note that accuracy, measured using this ground-truth, can theoretically be different from the accuracy measured using the standard method. One reason (although not the only one) is that the maximum-scoring answer may have a score that is below the (empirically determined) threshold, in which case no answer is selected and the empty set is returned. Note that, because the three ground-truths successively rely on weaker notions of what counts as correct, the expected performance of a system that randomly selects from the eight possible sets correspondingly increases. Specifically, across each of the test sets used for evaluating RQ1 (regardless of template), expected random performance using the Normal, Weak Normal and Weak ground-truth is 12.5%, 25%, and 62.5%, respectively. Expected random performance using the standard method is 33%, which is (expectedly) different from the Normal ground-truth random performance, for the reason discussed earlier.

Bet Questions
While RQ1 investigates LRMs' preference for high-value items over low-value items, RQ2 aims to go a step further by evaluating whether LRMs are able to use this knowledge, whether directly or indirectly, to think in bets. Similarly, RQ3 aims to evaluate whether the models, if fine-tuned on bet questions, are able to successfully think in bets that they have not seen before.
Both RQ2 and RQ3 require presenting to the model a bet question, with different uncertain outcomes (each of which is associated with an expected gains), to investigate the degree to which the model is able to make a rational bet. Although both research questions evaluate the model on bet questions presented during testing, the key difference between them, discussed subsequently, is that RQ2 does not use bet questions for fine-tuning, while RQ3 does.
We design the bet questions using a similar template-based methodology as the value questions. These templates contain placeholders for high-and low-value items (that can be instantiated using the items in Table 2), but to expand and diversify the manner in which bets are presented to the model, we constructed questions using different bet modalities, mnemonically designated as Coin, Card and Dice, and demonstrated in Table 4. Importantly, all three modalities involve equi-probable outcomes and are expressed using natural (rather than mathematical) language. We refer to an instantiated bet template as a bet question.  problematic for some language models, and that it may need special treatment [88,89]. To control for this, we rely on a construction that assumes (approximately) balanced expectation i.e., the wager is an amount that lies roughly between the high-and low-value items, but is skewed slightly toward the lower end to allow for some outcomes to have positive expected gain associated with them. More technically, if we denote the value of a prototypical high-and lowvalue item as H and L respectively, the wagered amount X is assumed to obey the inequality L <= X <= 0.5 * (H − L).

Modality
With the above assumption in place, we calculate expected gain for the different bet questions (reproduced in the Appendix), with ground-truths constructed accordingly. In all cases, this construction guarantees that there is always (exactly) one optimal-expectation outcome. However, depending on the question, there may be outcomes that have positive expected gain, but are not necessarily optimal. Furthermore, because we allow the model to choose not to bet, the optimal-expectation outcome is never negative; however, in some cases, not betting is optimal, with zero gain, since all other outcomes are associated with negative expected gain. As discussed in the next section, we evaluate the model under several different scenarios and metrics to gain a more comprehensive and robust understanding of its decision making, rather than always expecting it to choose the one outcome that is strictly optimal.
Finally, we instantiate and design the templates in a way that minimizes the possibility of model-overfitting due to (potential) superficial pattern matching.
One way in which we do so is by flipping the position of 'win' and 'lose' in the templates shown in Table 4 to avoid a fixed outcome always being the optimal (or even positive expected gain) choice. Formally, given m and n high-and lowvalue test items, respectively, the total number of Coin-modality test questions would be 2 × 2 × m × n = 4mn, and similarly for the other modalities. While If the optimal choice is not in the set, or choices besides the optimal choice are in the set, the model's prediction is deemed as incorrect.
For reasons similar to those stated earlier, the accuracy measured using the 'Strict' ground-truth can differ from the accuracy measured using the standard method. Similar to the RQ1 benchmark, because the three ground-truths successively take a broader view of what predictions count as correct, the expected performance of a system that randomly selects from the eight sets, correspondingly increases. Specifically, the expected random performance using the Strict, Positive Gain and Non-Negative Gain ground-truths is 12.5%, 25% and 25%, respectively. Expected random performance using the standard method is still 33%.

RQ1: Preference Elicitation
We use the value-questions benchmark described in Section 4.3.2 for investigating RQ1. The full benchmark comprises four different templates, each of which is instantiated using the train, development and test items described in Section 4.3.1. We report results for the default, and task-specific fine-tuned models, using both the standard method and the threshold method introduced in Section 4.2. We emphasize that, while there is a single default model per LRM, there is a task-specific fine-tuned model per LRM and per template. Hence, the full RQ1 evaluation involves a total of four default LRMs, and 16 task-specific fine-tuned LRMs.
We use the accuracy metric for reporting performance. A single accuracy estimate is reported for the standard method for each experimental setting (i.e., choice of LRM and template). In contrast, three accuracy estimates are reported for the threshold method, with each estimate corresponding to each of three ground-truths (Normal, Weak Normal and Weak). In a slight abuse of terminology, we use the name of the ground-truth itself to refer to the corresponding accuracy of the model being evaluated using that ground-truth.
We also report on the statistical significance of each result by using the onesided z-test to evaluate whether the result is better than the expected random performance i.e., the expected accuracy of a system that selects randomly from among the answer choices. For the standard method, the expected random performance is 33% for all experimental settings in RQ1, since only one out of three possible choices can be selected by any system being evaluated using the standard method (and exactly one choice is correct). For the threshold method, we similarly computed the fraction of correct answers for each of the three ground-truths, which would equal the expected random performance. We found the expected random performance for the Normal, Weak Normal and Weak ground-truth to be 12.5%, 25%, and 62.5%, respectively. In the results, we use a maximum Type-I error rate α = 0.05 to confirm significance (in other words, the one-sided P value, using the z-test, must not exceed 0.05 to be significant); however, the complete set of P values for all tests are also reproduced in the Appendix.

RQ2: Thinking in Bets Without Task-Specific Fine-Tuning
We use the bet-questions benchmark described in Section 4.3.3 for investigating RQ2. The four default models used for RQ1 are also used for RQ2.
Furthermore, since the (task-specific) fine-tuned models used for RQ1 behave similarly across the four templates (as the results for RQ1 will show), we only report results for the (RQ1) fine-tuned LRM that was fine-tuned on the Choice Valuable template. In total, this yields four fine-tuned models that are used for investigating RQ2.
Note that, because of the nature of this research question, the three templates (each corresponding to a different modality i.e., Coin, Dice and Card; see Table 4) contained in the benchmark only need to be instantiated for the development and test set items in Table 3. The former is only necessary for determining the optimal threshold for each evaluated model (four default and four fine-tuned) when using the threshold method, and is not needed when using the standard method.
The remainder of the testing procedure is similar to RQ1. When using the standard method, we report accuracy for each of the eight models described above, for each of the three modalities. Similarly, when using the threshold method, we report the three accuracy estimates (for each of the eight models) corresponding to the Strict, Positive Gain, and Non-Negative Gain ground-truth described earlier.
In addition to using the ordinary accuracy (which measures predictions against a ground-truth), this research question also seeks to assess whether Because the four default LRMs are already investigated in RQ2, we focus on the task-specific fine-tuned LRMs in RQ3. Specifically, we fine-tuned all four pre-trained LRMs on each of the three modalities, yielding a total of 4 * 3 = 12 task-specific fine-tuned LRMs.
One of the main goals in RQ3 is to evaluate whether a task-specific fine-tuned LRM that is fine-tuned using one modality (e.g., Card) is able to generalize reasonably well to the other two modalities (e.g., Coin and Dice). Hence, each of the 12 (task-specific fine-tuned) LRMs is evaluated separately on each of the Coin, Card and Dice data sets instantiated using the test items. This enables us to contrast an LRM's results when it is fine-tuned and tested on the same modality, versus a different modality.
We report results for both the standard and threshold method. Similar to the previous RQs, the development set is used for determining when to stop fine-tuning the model, and for selecting appropriate thresholds. A single accuracy estimate is reported when using the standard method for each experimental setting. Three accuracy estimates are reported for the threshold method, each corresponding to a ground-truth (Strict, Positive Gain and Non-Negative Gain).
Following RQ2, we use the name of the ground-truth itself to refer to the corresponding accuracy of the LRM when evaluated using that ground-truth. 3 Where task-specific fine tuning is involved, the model is fine-tuned using the same template as for the evaluation.
Statistical significance results are reported using the same methodology as for RQ2 i.e., by comparing each result, using the one-sided z-test, to the corresponding expected random performance. The expected random performance for both the standard method and the threshold method (per ground-truth) are identical to those reported earlier in RQ2, since the test sets used in RQ2 are identical to those used in RQ3.

RQ1: Preference Elicitation
Recall that the first research question (RQ1) involves evaluating whether, or which of, the different LRMs are able to successfully distinguish between a 'high-value' and 'low-value' item that they have not encountered during finetuning. As described in Section 4.4.1, four different templates are instantiated as 'value questions' for the purposes of investigating RQ1 to ensure that the results are robust to different choices of question and answer format with (effectively) the same content. We report the results in Table 5 for both the default and fine-tuned models. The results illustrate that, in the general case (i.e., for 12 out of 16 cases), the default models cannot distinguish between high-value and low-value items in a way that is (statistically) any better than random selection.
However, there are some interesting exceptions. For instance, the performance achieved by the BERT default model (68%, 56% and 52%) is significantly better than random for the Boolean Expensive, the Choice Expensive, and the Choice Valuable template, respectively. These results underscore the methodological decision to use four templates for robustly investigating RQ1, which must be (generally) borne in mind when evaluating LRMs because of their sensitivity to the particular form of input.
However, after they have been fine-tuned, the LRMs outperformed their default counterparts, and in all cases, achieve over 90+% accuracy with statistical significance. In other words, these models can distinguish between a 'high-value' and 'low-value' item after they have been fine-tuned on a data set that follows the same template, but is instantiated using different item sets (Table 2). This suggests that LRMs may already have the ability to distinguish between differently valued items, but need to be fine-tuned using appropriate prompts in order for us to access this ability (i.e., using similarly structured prompts, but with potentially different pairs of items). Additionally, if fine-tuning is permitted, the model also loses its sensitivity to the actual template. As shown in the table, performance is largely similar across the four templates for all the LRMs.  Table 3: Boolean Expensive, Boolean Valuable, Choice Expensive, and Choice Valuable. Where fine tuning is involved, the model is always fine-tuned using the same template as used during testing. Figure 1 reports the accuracy using the threshold method and its associated ground-truths: Normal, Weak Normal, and Weak. Recall that the expected random performance for each of these ground-truths is 12.5%, 25%, and 62.5%, respectively. Focusing on the default LRMs' results, we observe that they are unable to distinguish between high and low-value items better than random, regardless of which template and ground-truth is used, with few exceptions. In contrast, the fine-tuned LRMs exhibit performance that are well above random, and in some cases, near-perfect. Consistent with the previous result, this result suggests that LRMs may already have the ability to distinguish between highvalue and low-value items, and that they only need to be fine-tuned using the appropriate prompt to access this ability. Interestingly, we also find fine-tuned LRMs' performance to be identical across the three ground-truths, although the absolute performance depends on the specific LRM, with the fine-tuned BigBird consistently achieving the highest performance.

RQ2: Thinking in Bets Without Task-Specific Fine-Tuning
RQ2 was designed to test whether LRMs have (at least approximately rational) decision-making ability when bet questions are used as prompts. We investigate whether LRMs that have not been fine-tuned on bet questions, but are fine-tuned on value questions (also used for fine-tuning in RQ1), have such abilities. Note that the LRMs we used for investigating RQ2 are a subset of the LRMs used in RQ1. The default LRMS remain the same, whereas, for the fine-tuned LRMs, we only use the models fine-tuned on the 'Choice Valuable' template, owing to the homogeneous performance of each fine-tuned model across the four templates. The key difference between RQ1 and RQ2 lies in the prompt that is input to the models during testing. While value questions (instantiated using test set items) were used for investigating RQ1, bet questions are used for RQ2.  Table 3. § indicates that the result is statistically better than random performance (33%) with 95% confidence. The complete set of P values is reproduced in the Appendix.
Bold text indicates the best result, if statistically significant, for the given column and metric (e.g., ACC Default).
As discussed in Section 4.3.3, three bet modalities are instantiated for investigating the research question more robustly. We report both the default and fine-tuned LRMs' performance in Table 6 using two metrics: ordinary accuracy (ACC) and belief conditioned accuracy (BCA).
Considering first the default LRMs' performance using the ordinary accuracy metric, we find that the LRMs are not able to 'correctly' answer bet questions.
This result is qualitatively consistent with the results obtained for RQ1. However, on occasion, a better-than-random result is still obtained, such as for the default DeBERTa model on the Coin questions. In general, default models' accuracy (whether ordinary or belief-conditioned) does not exceed 50%.
Fine-tuned accuracy can be higher for some models and modalities, but a clear trend is not distinguishable, and results are not always significant. For example, the BERT fine-tuned model achieves a significant 52% performance on the Dice modality (using both the ACC and BCA metrics) but the DeBERTa fine-tuned model achieves lower performance than even its default counterpart 38 on the Coin modality. Hence, unlike for RQ1, fine-tuning does not yield definitive performance improvements on the task. Even when such improvements are observed, they are relatively lower than the near-perfect results obtained for RQ1 following fine-tuning.
It is also interesting to note that the BCA performance, on average, does not (significantly and consistently) exceed ACC performance for any given model.
Recall that the BCA is computed by constructing a 'ground truth' that is based on the model's own beliefs. In the Appendix, we also report performance when evaluating the models using the threshold method, but the conclusions were found to be largely consistent with Table 6. Namely, we found that default or (value-questions) fine-tuned LRMs are unable to correctly choose, on average, the option that maximizes the expected gain, or that even achieves positive expected gain.
One potential reason that a given model is unable to answer bet questions, despite being able to distinguish between high-and low-value items, may be its inability to understand the prompt at a syntactic level, since it has not been fine-tuned on bet questions. In RQ3, we investigate this hypothesis in more detail by evaluating models that have been fine-tuned on bet questions.

RQ3: Thinking in Bets After Task-Specific Fine-Tuning
In contrast with the previous experiment, in this experiment, we evaluate the LRMs after fine-tuning them on bet questions. There are two experimental goals: first, does fine-tuning on bet questions, instantiated with one set of highand low-value items ('train' set), improve performance on bet questions instantiated with a different set of high-and low-value items ('test' set), similar to what was observed for RQ1? Second, how does the performance get impacted when fine-tuning is conducted using one modality (e.g., Card) but tested using a different modality (e.g., Dice)?
39 Figure 2: The accuracy, with standard error bars (when non-zero), of fine-tuned LRMs on three instantiated bet-modality datasets (Card, Coin, Dice). Exact values are reproduced in the Appendix). The 'train' and 'test' sets are populated using the (respectively named) item sets in Table 2. The train set is always used for fine-tuning, and in some experimental conditions (blue bars) is also used for testing. The modality used during fine-tuning is indicated along the vertical axis, while the modality used during testing is indicated along the horizontal axis.
All results are statistically better than random performance (33%) with 95% confidence. The complete set of P values is reproduced in the Appendix. 2), fine-tuned using the three different modalities on the 'train' set items, and evaluated using the three modalities on both the 'train' and 'test' set items.
Our reason for also evaluating each fine-tuned model on the train set is to assess the impact of modality on performance while controlling for the item-sets.
We find that all results are statistically significant compared to random performance. Note that, since the test-set BCA was found to be exactly the same as the corresponding test-set (ordinary) accuracy, BCA results are not shown in the figure.
The diagonal plots in Figure 2 show that all fine-tuned LRMs achieve nearperfect performance regardless of whether the 'train' or 'test' set is used for evaluation. This provides some support for our earlier claim that there may be a strong dependency on the prompt and its format, and that the choice of item-set matters much less. In other words, when the bet modality is known in advance and can be used for fine-tuning, the performance of the model is expected to be high (at least for the equi-probable bet modalities considered here).
Additionally, when we consider the figure as a whole and compare the 'train' and 'test' performance (blue versus orange) in each experimental setting, we find that there is no strong dependence on the choice of the item-set used during evaluation. Surprisingly, the model does not gain a noticeable advantage from 're-observing' the 'train' item set during evaluation. This may be because we explicitly designed the benchmark to avoid potential overfitting: the same pair of items is used in several bet questions, but with different 'optimal' outcomes.
For example, in one bet question, a high-value item may be on the losing side of the bet, and a low-value item on the winning side (and vice versa, in another bet question). Therefore, the fine-tuned LRMs are unable to 'memorize' their way to the correct answer, as would ordinarily be expected when the train-set is re-used during testing.
Turning to the two off-diagonal or cross-modal entries in each row of the figure, we find that all four LRMs show about 25% decrease in performance, compared to their respective diagonal entries. This provides further evidence of the dependence of performance on modality, regardless of item-set used. However, despite this relative decline in performance, the fine-tuned LRMs still achieve an average accuracy of around 70% in most experimental settings, which is well above random performance and qualitatively within reach of state-of-the-art performance on many QA benchmarks [46], [74], [72]. The performance is especially striking compared to the results in the previous section, where most LRMs' performance could not be statistically distinguished from random performance, and the best performance was only 53%.
Additionally, there is no one LRM that was found to consistently out-perform the others when comparing models in each cross-modal experiment (or offdiagonal plot). For example, while DeBERTa exhibits the best performance in some cross-modal settings (e.g., when fine-tuning on Card and testing on   Table 2, with the train set items always used for fine-tuning.
The fine-tuning and evaluation modalities are indicated along the vertical and horizontal axis, respectively. The three ground-truths were introduced in Section 4.3.3.
Similar to the previous result, the highest performance observed (across LRMs) in the diagonal entries is near-perfect, regardless of the ground-truth used. However, some models do much worse than others in the same experimental setting. For instance, we observe performance (using the Strict groundtruth) as low as 40% for DeBERTa, when it is fine-tuned and evaluated on the Dice modality.
In the cross-modal (off-diagonal) setting, we observe an average decline of more than 40% when evaluating the LRMs using the Strict ground-truth. Compared to Figure 2, this result suggests that the threshold method proves to be a more challenging evaluation paradigm for the LRMs than the standard method, at least when the optimal answer is expected (captured by the Strict ground-truth). As expected, the performance rebounds when the model is only expected to choose outcomes that have positive expected gain or non-negative expected gain. In line with the previous results, we find again that performance of the LRMs is strongly dependent on the modality used during fine-tuning, and that their ability to generalize to other modalities is limited. This claim is also indirectly supported by the considerable variance observed across models and settings, even when using the less conservative (Positive Gain and Non-Negative Gain) ground-truth.
Focusing on the Positive Gain results in the cross-modal setting, we observe several near-zero entries (e.g., when BERT was fine-tuned on the Card modality and tested on the Coin and Dice modalities, as well as when BigBird was finetuned on the Coin modality and tested on the Card modality), because of the LRMs' inability to identify any outcome with positive expected gain, even when such outcomes are present among the candidate answer choices.
Interestingly, when evaluated using the Non-Negative Gain ground-truth, the LRMs do show some evidence of generalization. For instance, even in the cross-modal setting, we observe that their performance decreases (on the Non-Negative Gain ground-truth) by a lower margin than on the other two groundtruths. This result is consistent with the one in RQ2, suggesting that they are better (and also generalize better) at identifying non-negative outcomes rather than strictly positive and optimal outcomes.

Summary of Results
This section summarizes the key results from the previous section: 1. In investigating RQ1 (Section 4.4.1), we found that the LRMs can distinguish between low-and high-value items (with accuracy above 90%), if they are fine-tuned on a train set with the same template as the test set, even if the latter is instantiated with items not seen before during fine-tuning. This result also suggests that our choice of high-and lowvalue items is not arbitrary, and that the model largely agrees with our distinction between these two item-sets. Without the template-specific fine-tuning, however, the default LRMs' performance was found to be statistically indistinguishable from random performance.
2. In investigating RQ2 (Section 4.4.2), the results show that the default LRMs cannot make 'rational' bets (with few exceptions) any better than random guessing. Fine-tuning on the value questions was not found to achieve any noticeable difference in performance. Even when evaluating LRMs' performance using Belief Conditioned Accuracy (BCA), we still found their performance to be indistinguishable from random performance. In other words, even when conditioned on its own belief about whether one item is (relatively) higher-valued than another item, the model is unable to make a rational bet involving the two items.
3. In investigating RQ3 (Section 4.4.3), we found that the LRMs can make rational decisions when they are fine-tuned on bet questions. If the evaluation modality (say, Coin) is the same as the fine-tuning modality, the performance is typically above 95%, regardless of the item-set used (i.e., train or test). If the evaluation modality is different from the fine-tuning modality, the LRMs' performance is noticeably lower, but still exceeds 70% accuracy on average. This suggests that the models are able to achieve (limited) generalization beyond the fine-tuning modality. itive Gain, and Non-Negative Gain), we found that the LRMs achieve higher performance, and generalize better, on the easier problem of choosing outcomes with non-negative expected gain rather than outcomes that lead to strictly positive expected gain. Similarly, the models perform even worse (on average) in selecting outcomes that are optimal (Strict) than those that only lead to positive expected gain, but may not necessarily be optimal.

Discussion
Based on the results of RQ1, a natural question arises as to why near-perfect performance was observed on the value questions after fine-tuning the LRMs.
Although the LRMs might be 'learning' to prefer high-value items over lowvalue items due to the fine-tuning, we consider this possibility to be unlikely due to the fact that the data set used for fine-tuning is relatively small, and also that the test items are significantly different from the training items (and were independently selected). Rather, the likely reason is that the LRMs are sensitive to format, and that fine-tuning the model was akin to teaching it the format and the semantics of the preference being elicited. This allowed it to generalize to different pairs of (unseen) items, and to learn the correct preference function.
While it is certainly possible that the model has learned the same preference function that we used to construct the item sets (i.e., by determining, in a commonsense fashion, whether an item was high-value or low-value), this claim is notoriously difficult to prove due to the black box nature of the LRM.
However, the empirical evidence strongly suggests that, after controlling for format, there is agreement between the LRMs' preference function and ours.
Indeed, the near-perfect accuracy on the test set shows that, on average, the LRMs' assignment of high-and low-value items agrees with ours. More impor-tantly, because the LRMs were independently able to replicate our assignment of test items to high-and low-value buckets, errors in the LRMs' performance, including in RQ2, cannot be explained by (hypothetical) arbitrariness in our assignment of high-and low-value items. This is further confirmed by the BCA results when evaluating RQ2. Theoretically, if such arbitrariness had existed (or due to some other analogous reason), the BCA could have be very different from the ordinary accuracy. One fundamental reason why such a clear alignment was observed between BCA and ordinary accuracy (which, as explained earlier, is due to the agreement between the LRMs', and our, implicit preference functions) may be due to our items being rather extreme in terms of their assigned value. We suspect that BCA will start empirically diverging from standard accuracy when the value difference between items is (arguably) more ambiguous. For example, would an LRM consider a silver ring to be more valuable than a platinum ring (as opposed to, say, a plastic pen, which is a far easier distinction to make)? Conducting similar experiments while controlling for the extremity between the putative values of high-and low-value items is an interesting agenda to consider for future research.
Considering RQ3, when the fine-tuned LRMs were tested on a modality different from the one used during fine-tuning ('cross-modal setting'), they are more likely to correctly pick outcomes with non-negative expected gain than outcomes with positive expected gain. Although this might be the case due to the former problem seemingly being easier than the latter, there could also be a methodological explanation. Recall that in Section 4.3.3, we constructed the bet questions in two different ways. One type of question included among its choices an outcome that the bet-maker wins a high-value item, and another outcome that the bet-maker loses a low-value item. Another type of question had as choices an outcome that the bet-maker loses a high-value item, and an outcome that the bet-maker wins a low-value item. While we reproduce the mathematical expressions of the expected gain of each outcome for each of the two kinds of bet questions in the Appendix, intuitively, the outcome that maximizes the expected gain of the second type is 'do not bet' (which has an expected gain of 0 In fact, our results show that there is no one modality that leads to clearly better generalization (defined as using different item-sets and different modalities during testing, compared to fine-tuning). For example, the LRMs were found to achieve the best performance, on average, on the Card modality when they were fine-tuned on the Coin modality. In contrast, they achieved the best aggregate performance on the Dice modality when they were fine-tuned on the Card modality.
Finally, although three modalities were used in our experiments, there are (potentially) an infinite number of modalities that could be devised for probing these models' decision making. This begs the (open) question of whether there is a 'general' fine-tuning procedure that would result in consistently high average performance across any reasonable modality, rather than the one used during fine-tuning.

Conclusion and Future Work
Modern Language Representation Models (LRMs), based on transformer neural networks, have rapidly exceeded the previous state-of-the-art on a range of natural language understanding tasks, including question answering, text summarization, and information extraction [91,92,93]. In this article, we addressed the question of whether such LRMs can be adapted for (approximately) rational decision-making and preference elicitation. In the cognitive science literature, such decision-making is often evaluated using bets. Given the near human-like performance of LRMs on language-based problems, we formulated a set of research questions to specifically test whether: (i) LRMs have distinct preference for high-value items over low-value items, especially when the items were not seen during training, and after stratifying by the format of the questions, (ii) LRMs can make, or be taught to make, (approximately rational) bets in a generalizable manner, including when an LRM has been fine-tuned on one 'modality' of bet, but is evaluated on another modality.
We constructed a set of novel benchmarks to empirically test these hypotheses using four established transformer-based LRMs. Our first set of results show that, while LRMs can distinguish between unseen high-and low-value items, they only do so after stratifying by the format of the questions through fine-tuning. This is despite the question being expressed in relatively simple language, and the items being of an everyday nature. The second set of results is similar: LRMs can only make bets (whether posed using the same, or different, modality as the training set) once they have been fine-tuned on similar bet questions. We find, furthermore, that changing the modality of the bet typically leads to a noticeable drop in performance, but is still much higher than random. Thus, while the models do seem to be generalizing, their ability to do so is limited, at best. For a subset of the bet questions, we also find some evidence of overfitting.
There are many promising avenues for future research. Our experiments have only probed the surface of these LRMs' decision-making abilities, since our benchmarks test decision-making when the number of outcomes is limited and equi-probable, and the putative value difference between the pair of items (high-and low-value) is extreme. It remains to be seen whether newer models (such as T5 [94]) would generalize more effectively in the cross-modal setting, and to more complex decision-making. Considering that there are infinitely many decision-making modalities (in theory), the larger question remains as to the general methodology or approach required for the LRMs to achieve human-level decision-making performance on any reasonable modality. This question is motivated by the observation that, in the real world, decisions are not framed precisely or explicitly, and the modalities used (if any) are unknown a priori.
To be applicable in such situations, therefore, LRMs need to be able to make decisions as a fundamental capability, as opposed to some form of brute-force fine-tuning on ever larger corpora.
Finally, another direction that could be explored is the use of generative models, such as the Generative Pre-trained Transformer 3 (GPT-3) model [3], for decision making. Such models have yielded promising results on zero-shot learning problems, and have even been shown to exhibit human-like creativity.
They may be more amenable to decision making in an open-ended setting when no option is given (or can be computed in advance), but an evaluation to that effect has not been conducted yet and is a promising avenue for future research.
Do Not Bet X − X = 0 where the model selects both the 'Head' and 'Tail' as its (multi-label) prediction, in which case X is assumed to be equally split and wagered on both heads (X/2) and tails (X/2).       Table 13: P values corresponding to the results in Table 6, when comparing to random performance of 33%.

Choice
Threshold method results for RQ2 Table 14 reports the LRMs' results on the three bet modalities using the threshold method detailed in Section 4.2, using the three different ground truths discussed in Section 4.4.2.  Table 4, using the threshold method for each of three different ground-truths (Strict, Positive Gain, Non-Negative Gain) discussed in Section 4.3.3. Where fine-tuning is involved, the LRMs are fine-tuned on (Choice Valuable) value questions introduced in Table 3. § indicates that the result is statistically better than random performance with 95% confidence. The threshold results confirm that the LRMs cannot correctly choose the option that maximize the expected gain (Strict), or even achieves positive expected gain (Positive Gain). The best performance, among these two ground-truths, is only 25%, achieved by the fine-tuned BigBird using the Strict ground-truth.

Modality
However, we do observe some improvements when evaluating the LRMs using the Non-Negative Gain ground-truth e.g., in this setting, fine-tuned BigBird is able to achieve 52%. This improvement in performance suggests that selecting options that have non-negative expected gain may be a easier problem for the LRMs than selecting options with strictly positive, and optimal, gains.
Moreover, we observe some general improvements when we compare the results of the default models to those of the fine-tuned models. This suggests that, using the threshold method, the fine-tuned LRMs can perform slightly better than their default counterparts, but are still not able to significantly outperform random selection. Interestingly, we observe that LRMs' consistently achieve better performance on the Card modality than on the other two modalities, which suggests that the Card modality might be a 'easier' modality (at least for the LRMs) than the other two. In general, the conclusion from the threshold-based results is in alignment with the previous conclusion that neither the default nor (value questions) fine-tuned LRMs perform better than random in making rational bets.
The P values corresponding to the results in Table 14 are tabulated in Table   15.  Table 16: P values corresponding to the results in Figure 2, when comparing to random performance of 33%.   Table 18: P values corresponding to the 'Positive Gain' results in Figure 3, when comparing to random performance of 25%.  Table 19: P values corresponding to the 'Non-Negative Gain' results in Figure 3, when comparing to random performance of 25%.   Figure 2