Automatic Chain of Thought Prompting in Large Language Models

  title={Automatic Chain of Thought Prompting in Large Language Models},
  author={Zhuosheng Zhang and Aston Zhang and Mu Li and Alexander J. Smola},
Large language models (LLMs) can perform complex reasoning by generating intermediate reasoning steps. Providing these steps for prompting demonstrations is called chain-of-thought (CoT) prompting. CoT prompting has two major paradigms. One leverages a simple prompt like “Let’s think step by step” to facilitate step-by-step thinking before answering a question. The other uses a few manual demonstrations one by one, each composed of a question and a reasoning chain that leads to an answer. The… 

Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters

It is shown that CoT reasoning is possible even with invalid demonstrations—prompting with invalid reasoning steps can achieve over 80-90% of the performance obtained using CoT under various metrics, while still generating coherent lines of reasoning during inference.

Large Language Models are reasoners with Self-Verification

This work proposes a new method called self-verification that uses the conclusion of the CoT as a condition to build a new sample and asks the LLM to re-predict the original conditions which be masked, and calculates an explainable verification score based on the accuracy.

Towards Reasoning in Large Language Models: A Survey

A comprehensive overview of the current state of knowledge on reasoning in large language models, including techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, and suggestions on future directions are provided.

Self-Prompting Large Language Models for Open-Domain QA

This paper shows that the ODQA architecture can be dramatically simplified by treating Large Language Models (LLMs) as a knowledge corpus and proposes a Self-Prompting framework for LLMs to perform ODZA so as to eliminate the need for training data and external knowledge corpus.

A Survey for In-context Learning

The progress, challenges, and future work in ICL are summarized and a formal definition of ICL is presented and its correlation to related studies are clarified and potential directions for further research are provided.

SPT: Semi-Parametric Prompt Tuning for Multitask Prompted Learning

SPT, a semi-parametric prompt tuning method for multitask prompted learning, is proposed and its novel com-ponent is a memory bank from where memory prompts are retrieved based on discrete prompts.

Reasoning with Language Model Prompting: A Survey

This paper provides a comprehensive survey of cutting-edge research on reasoning with language model prompting and introduces research works with comparisons and summaries and provides systematic resources to help beginners.

A Survey of Deep Learning for Mathematical Reasoning

This survey paper reviews the key tasks, datasets, and methods at the intersec-tion of mathematical reasoning and deep learning over the past decade, and evaluates existing benchmarks and methods and discusses future research directions.

Multimodal Chain-of-Thought Reasoning in Language Models

This work proposes Multimodal-CoT, a model under 1 billion parameters that in-corporates vision features in a decoupled training framework that outperforms the previous state-of-the-art LLM by 16% on the ScienceQA benchmark and even surpasses human performance.

Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning

This work exploits large language models (LLMs) as decomposers for effective table-based reasoning and proposes a “parsing-execution-filling” strategy to alleviate the hallucination dilemma of the chain of thought by decoupling logic and numerical computation in each step.



Chain of Thought Prompting Elicits Reasoning in Large Language Models

Experiments on three large language models show that chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks.

Large Language Models are Zero-Shot Reasoners

Experimental results demonstrate that the Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics, symbolic reasoning, and other logical reasoning tasks, without any hand-crafted few-shot examples.

STaR: Bootstrapping Reasoning With Reasoning

A technique to iteratively leverage a small number of rationale examples and a large dataset without rationales to bootstrap the ability to perform successively more complex reasoning, called STaR, which lets a model improve itself by learning from its own generated reasoning.

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Experiments on symbolic manipulation, compositional generalization and numerical reasoning demonstrate that least-to-most prompting can generalize to examples that are harder than those seen in the prompt context, outperforming other prompting-based approaches by a large margin.

Do Prompt-Based Models Really Understand the Meaning of Their Prompts?

It is found that models can learn just as fast with many prompts that are intentionally irrelevant or even pathologically misleading as they do with instructively “good” prompts, and instruction-tuned models often produce good predictions with irrelevant and misleading prompts even at zero shots.

Rationale-Augmented Ensembles in Language Models

It is demonstrated that rationale-augmented ensembles achieve more accurate and interpretable results than existing prompting approaches—including standard prompting without rationales and rationale-based chain-of-thought prompting—while simultaneously improving interpretability of model predictions through the associated rationales.

Self-Consistency Improves Chain of Thought Reasoning in Language Models

A simple ensemble strategy, self-consistency, that robustly improves accuracy across a variety of language models and model scales without the need for additional training or auxiliary models is explored.

On the Advance of Making Language Models Better Reasoners

This paper conducts extensive experiments using the latest language model code-davinci-002 and demonstrates that D I V E RS E can achieve new state-of-the-art performance on six out of eight reasoning benchmarks, out-performing the PaLM model with 540B parameters.

Training language models to follow instructions with human feedback

The results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent and showing improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

This paper shows that ground truth demonstrations are in fact not required and that other aspects of the demonstrations are the key drivers of end task performance, including the fact that they provide a few examples of the label space, the distribution of the input text, and the overall format of the sequence.