A comparison of chain-of-thought reasoning strategies across datasets and models

Emergent chain-of-thought (CoT) reasoning capabilities promise to improve the performance and explainability of large language models (LLMs). However, uncertainties remain about how reasoning strategies formulated for previous model generations generalize to new model generations and different datasets. In this small-scale study, we compare different reasoning strategies induced by zero-shot prompting across six recently released LLMs (davinci-002, davinci-003, GPT-3.5-turbo, GPT-4, Flan-T5-xxl and Cohere command-xlarge). We test them on six question-answering datasets that require real-world knowledge application and logical verbal reasoning, including datasets from scientific and medical domains. Our findings demonstrate that while some variations in effectiveness occur, gains from CoT reasoning strategies remain robust across different models and datasets. GPT-4 benefits the most from current state-of-the-art reasoning strategies and performs best by applying a prompt previously discovered through automated discovery.


INTRODUCTION
Emergent chain-of-thought (CoT) reasoning capabilities in large language models (LLMs) promise to improve both predictive performance and explainability of models when applied to complex tasks (Wei et al., 2021).While good performance can be achieved by few-shot in-context prompting with exemplars suitable to a specific task, zero-shot prompting setups do not require task-dependent selection of exemplars (Kojima et al., 2022).The recent success of models optimized for dialog, such as GPT-3.5, further increases the expectation that models reach robust performance with ad-hoc reasoning strategies and are less influenced by minor variations.This study empirically investigates how previously discovered zero-shot CoT prompting styles generalize to new model generations and datasets and how they compare to newly developed reasoning strategies.We conduct our evaluations on six question-answering datasets of varying levels of complexity, including

Dataset Description
CommonsenseQA General domain crowd-sourced questions with high semantic complexity that command the use of prior knowledge.

StrategyQA
General domain crowd-sourced questions that require implicit reasoning and multi-step answer strategies.Yes/No answers.

WorldTree v2
Elementary science questions for 3rd to 5th-grade level, combining domain-specific and world knowledge.
OpenBookQA Scientific and broad common knowledge questions that require multi-step reasoning and rich text comprehension.

MedQA
Questions from medical board exams.We used only examples from the US (USMLE subset).

MedMCQA
Real-world medical entrance exam questions.
scientific and medical domains.Portions of this text were previously published as part of a preprint (https://arxiv.org/abs/2305.02897).

Datasets
For our study, we used the ThoughtSource framework (Ott et al., 2023), which provides a comprehensive meta-dataset and software library designed for streamlined generation, evaluation, and annotation of chain-of-thought (CoT) reasoning.We focused on realworld datasets that combined knowledge application with logical, verbal reasoning.We covered a sizable range of topics and complexity levels by selecting subsamples of six question-answering datasets spanning common-sense (Talmor et al., 2019;Geva et al., 2021), scientific (Xie et al., 2020;Mihaylov et al., 2018), and medical domains (Jin et al., 2021;Pal, Umapathi & Sankarasubbu, 2022) (Table 1).These datasets were multiple-choice, consisting of two to five answer options with a single correct response.We used a template to structure the input handed to the model.Each of our chain-ofthought prompts is placed before the question, which we call ''instruction,'' or after the question, which we call ''cot trigger''.Table 2 shows the exact template.An extra new line was added after the instruction or before the cot-trigger to separate the chain-of-thought prompt from the question.

Prompts
We assembled a set of ten zero-shot reasoning strategies (Table 3) consisting of one baseline, two pre-existing, and seven novel designs: 1. Direct prompting: No specific trigger, serving as a baseline for comparison.2. Kojima: A well-established CoT prompt, ''Let's think step by step.''(Kojima et al., 2022) 3. Zhou: An enhanced version created through automated prompt engineering, ''Let's work this out in a step by step way to be sure we have the right answer.''(Zhou et al., 2023b) Table 2 Comparison of prompt locations.Each prompt is used either at the end (cot-trigger) or at the beginning (instruction), with its type detailed in  ''Create a dialog between a professor and a student.The student asks sub-questions to the question.The professor works them out in a step by step way and makes sure that the student understood how they got to the right answer.'' 4. Seven original reasoning strategies we designed, inspired by various public resources (OpenAI, 2023a;Schulhoff, 2022), and refined through iterative adaptation based on analyzing outputs.One of these strategies employed a self-critique strategy, requiring the model to provide an initial answer, critique it, and then propose a revised response (Madaan et al., 2023;Saunders et al., 2022).
Between February and April 2023, we conducted 11,880 experiments, with the model temperature set at 0 for maximal determinism.We limited the output to 512 tokens to allow for thorough reasoning while preventing the occasional issue of infinite sequence repetition observed in smaller models.We used the LangChain framework (Chase, 2022) to access several APIs.Usage costs included: $190 for OpenAI's models through their API, Cohere's model for free via their API, and $30 for Flan-T5-xxl through a Hugging Face Inference Endpoint.

Evaluation
We selected Krippendorff's alpha as our evaluation metric (Krippendorff, 2011).It allows for combining results from sub-datasets with different numbers of answer choices by correcting for their corresponding base probability rates.Krippendorff's alpha measured inter-rater reliability on a scale from zero (random chance) to one (complete agreement) and was used to compare model predictions to gold standard answers (Castro, 2017).We performed a power analysis using the formula below to determine an appropriate sample size.
4(1−α min )P c (1−P c ) − α min where: 1. P c the probability of value c 2. α min the smallest α for coding to be accepted as reliable 3. p level of significance 4. z p the standardized z-statistics at p.
We performed a power analysis with a significance level set at 0.05, a medium Krippendorff's alpha value of 0.8, and a base correct probability of 0.2, considering the maximum of five answer options in our sub-datasets.The analysis yielded a required sample size of 164 items, which we increased to 198 items, divided into six sub-datasets of 33 each.We used bootstrapping (r = 1,000) to compute means and confidence intervals for the generated results.To guarantee accurate Krippendorff scores, which depend on the number of options, we bootstrapped each sub-dataset individually when needed and calculated confidence intervals by pooling standard deviations.

RESULTS
All scores within this paper are displayed with 95% confidence intervals (CI).Although the performance of many prompts averaged over all datasets is notably similar, applying reasoning strategies outperforms direct prompting.A closer examination of the results obtained from the latest model, GPT-4, highlights the advantage of employing specific prompts (Table 4).It shows the retained performance of the automatically discovered prompt by Zhou et al. (2023b), which also has a notable result in the score averaged over models.Interestingly, the self-critique prompt yielded relatively low scores.It also generated multiple answers in various observed instances, which were excluded from the scoring process.The 512-token output limit was reached in only 80 of 11,800 experiments, having no significant impact on our results.This occurred primarily due to bogus sequence repetitions in smaller models or a prompt designed to mimic conversations.
Better models find WorldTree v2 and CommonsenseQA increasingly easy, while StrategyQA suffers from ambiguous items.This highlights the need to develop more refined general-knowledge datasets or employ domain-specific datasets.The two medical datasets were the most difficult to solve (Table 5).
GPT-4 and GPT-3.5-turboperformed best (Table 6).FLAN-T5 shows surprisingly good performance for its size, but its results are probably affected by data contamination: it was   instruction-fine tuned on the sub-datasets CommonsenseQA and StrategyQA.That effect shows clearly in its score on CommonsenseQA, where FLAN-T5 has a similar score to GPT-4 (Table 7).It remains an open question why the data contamination did not equally affect the score of FLAN-T5 for StrategyQA.Table 7 also shows a large performance difference on the specialized medical datasets, where only the top models GPT-4 and GPT-3.5-turbodisplayed decent performance.The Krippendorff's alpha scores reveal clearly that FLAN-T5 performs merely better than chance on the dataset MedQA (GPT-4 vs. FLAN-T5: .55 vs. .02)and point out the actual performance difference to GPT-4 much better than the accuracy scores (Table 8, GPT-4 vs. FLAN-T5: .65 vs. .22).
Comparing the scores of direct prompting with all of the prompts for externalized reasoning (Table 9) shows that the models Command-XL and GPT-4 profit the most from  externalized reasoning, whereas for FLAN-T5, direct prompting still ranks as one of its optimal methods.The comparison of the positioning of the externalized reasoning at the end of the input ''Zhou'' vs. at the beginning of the input ''Zhou-instruction'' did not matter a lot for most of the models, except for Davinci-002, where putting the externalized reasoning at the end showed better performance.
Different prompts work well on specific datasets (Table 10).Comparing the two overall best prompts, ''Zhou'' and ''Rephrase'', shows that ''Zhou'' excels at OpenBookQA but not at StrategyQA, whereas ''Rephrase'' excels at StrategyQA but not at OpenBookQA.This could be because rephrasing helps untangle the ambiguous formulation of the questions in StrategyQA.The comparison of prompts on the MedQA dataset is made difficult by the low average scores achieved by several models, which makes the effects of specific prompts harder to detect.
The results reported as accuracy values can be found in the appendix.

Conclusion.
Our findings suggest that using reasoning strategies significantly improves performance beyond what is achieved through direct prompting alone.Interestingly, this improvement does not strongly correlate with the model's size, as both GPT-4, a very large model, and Command XL, a smaller model, show the most benefits of employing reasoning strategies.The state-of-the-art prompts developed by Zhou and Kojima demonstrate robust performance across a wide range of scenarios (Zhou et al., 2023b;Kojima et al., 2022).Comparing different prompt strategies presents challenges due to ceiling effects in larger models, which easily handle simpler datasets.In comparison, smaller models struggle with more challenging medical datasets, irrespective of the prompt strategy.As expected, GPT-4, the most powerful of the models tested, consistently outshines the others across all datasets.We found that Krippendorff's alpha is a practical and intuitively understandable metric for evaluating performance on multiple-choice datasets, proving especially useful when merging scores from datasets with varying numbers of answer choices.
Limitations.The presented work has several limitations.Our study aimed to test various combinations of prompts, datasets, and models under budgetary constraints.We opted to subsample datasets based on a statistical power analysis to achieve this.This limits the direct comparison of our results to evaluations on full benchmark test sets.Upon inspecting results for some academic benchmark datasets generated through crowdsourcing, we found that the quality of a sizable subset of examples was not optimal.One typical pattern we found was that questions and answer choices did not allow for clearly picking a single best answer, but multiple options were reasonable.More advanced models tend to correctly point out such problems in their reasoning response and refrain from selecting a single answer choice.We did not use methods such as self-consistency (Wang et al., 2022) that maximize final accuracy at the expense of practical interpretability, i.e., we targeted situations in which users expect a single, high-quality and easily interpretable reasoning chain rather than a collection of noisy reasoning chains.Results achieved when using prompts in conjunction with ensemble methods might potentially differ.
Our study included state-of-the-art closed-source models, which constantly change, making replication and comparisons over time difficult.We partially address this concern by making all data generated by models at the time of our experiment openly available.The lack of documentation of closed models also leads to concerns about the contamination of training data with benchmark datasets.According to the GPT-4 technical report, the influence of data leakage during pre-training is relatively minor (OpenAI, 2023b).However, this assertion might not hold across the board.Fine-tuning on parts of the benchmark data would, of course, have a much bigger effect (Zhou et al., 2023a).Additionally, the continuous retraining of models like GPT-4 on collected usage data poses another leakage threat, as benchmark dataset examples could be reintroduced to the model in altered forms, circumventing basic string matching defenses (Balloccu et al., 2024).While data contamination issues do not severely impact our comparison of different prompts, we caution against strongly interpreting results across different models.We noted that Flan-T5 (Longpre et al., 2023), instruction-finetuned on the subsets of CommonsenseQA and StrategyQA, outperformed GPT-3.5-turbo on CommonsenseQA.
The dataset-specific performance comparison between text-davinci-003 and textdavinci-002 presents an interesting finding.In this comparison, text-davinci-003 only demonstrates superior performance over text-davinci-002 in two of six datasets, with both models scoring equally on the other four.Remarkably, the datasets where text-davinci-003 leads are not from the same domain: it shows an advantage in the medical dataset MedQA but not in another, MedMCQA, and the scientific dataset OpenBookQA but not in WorldTree.This raises questions, as one would anticipate that the next model generation would demonstrate improved performance across datasets on the same topic.
Our objective was to evaluate the efficacy of state-of-the-art prompts across diverse models and question-answering datasets to determine their performance consistency under various conditions.To achieve this, we developed new prompts by integrating insights from recent research or adapting high-performance prompts specific to our needs.We acknowledge that prompts which are semantically similar but differ syntactically can lead to varied outcomes, as observed in medical datasets (Liévin, Hother & Winther, 2022).Future research could explore the comparison of semantically similar prompts, such as through paraphrasing, to further this understanding.
Related work.Several related studies evaluated zero-shot prompting performance.As a notable example, Liévin, Hother & Winther (2022) performed a comparable zero-shot CoT evaluation focused on medical datasets.Earlier work evaluating multiple models and datasets zero-shot includes commonsense data (Zhou et al., 2020) and assessing the performance of T0 on multiple-choice tasks (Orlanski, 2022).HELM (Liang et al., 2022) covers a wide range of model comparisons.Zero-shot reasoning can also be enhanced by generating precise reasoning steps building upon each other (Ling et al., 2024) or including logic dependencies for iterative verification and revision (Zhao et al., 2023).Our study added to current knowledge by focusing on finding simple and versatile chain-of-thought prompting approaches that work across a spectrum of models.Our included questionanswering datasets go beyond simple mathematical questions that could also be solved via an external tool but instead focus on testing a combination of logical, verbal reasoning with real-world knowledge.
Future work.The current study can be extended by evaluating prompts and datasets with additional models, particularly the multitude of openly available LLMs like LLaMa, the Pythia suite, dialog-tuned models like Alpaca (Touvron et al., 2023;Biderman et al., 2023;Taori et al., 2023), StableLM (Stability AI, 2023), andOpenAssistant (LAION, 2023).Finally, user evaluations of the quality and explanatory utility of reasoning chains generated by different prompts and models need to be conducted.Don't make your LLM an evaluation benchmark cheater.ArXiv arXiv:2311.01964.

-Thought prompt as Trigger Chain-of-Thought prompt as Instruction
Table 3 as parentheses.Example from CommonsenseQA.

Table 3
Prompt descriptions.Prompt names with temples in brackets (see Table2for details) and their corresponding text.

Table 4
Performance of prompts.Krippendorff's alpha (α) performance of prompts averaged over datasets.Average taken solely for GPT-4 and over all six models, best results in bold.N total = 11,880.

Table 5 Performance on dataset
.Krippendorff's alpha (α)performance on datasets averaged over models and prompts, best results in bold.N total = 11,880.Datasetα(CI) n per dataset = 1980

Table 6
Performance of models.Krippendorff's alpha (α) performance of models averaged over datasets and prompts, best results in bold.N total = 11,880.

Table 7
Performance of models per dataset.Krippendorff's alpha (α) performance of models per dataset averaged over prompts.Average over 330 items per model/dataset pair, best results in bold.N total = 11,880.

Table 8 Accuracy of models per dataset
. Accuracy of models per dataset averaged over prompts.Average over 330 items per model/dataset pair, best results in bold.N total = 11,880.

Table 9 Performance of prompts per model.
Krippendorff's alpha (α)performance of prompts per model averaged over datasets.Average over 198 items per prompt/model pair, best results in bold.N total = 11,880.

Table 10 Performance of prompts per dataset.
Krippendorff's alpha (α)performance of prompts per dataset averaged over models.Average over 198 items per prompt/dataset pair, best results in bold.N total = 11,880.