Self-Consistency Improves Chain of Thought Reasoning in Language Models

@article{Wang2022SelfConsistencyIC,
  title={Self-Consistency Improves Chain of Thought Reasoning in Language Models},
  author={Xuezhi Wang and Jason Wei and Dale Schuurmans and Quoc Le and Ed Chi and Denny Zhou},
  journal={ArXiv},
  year={2022},
  volume={abs/2203.11171}
}
We explore a simple ensemble strategy, self-consistency , that significantly improves the reasoning accuracy of large language models. The idea is to sample a diverse set of reasoning paths from a language model via chain of thought prompting then return the most consistent final answer in the set. We evaluate self-consistency on a range of arithmetic and commonsense reasoning benchmarks, and find that it robustly improves accuracy across a variety of language models and model scales without the… 

Chain of Thought Prompting Elicits Reasoning in Large Language Models

Experiments on three large language models show that chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks.

Large Language Models Can Self-Improve

This work uses a pre-trained LLM to generate “high-confidence” rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and conducts ablation studies and shows that ablation on reasoning is critical for self-improvement.

Complexity-Based Prompting for Multi-Step Reasoning

This work proposes complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning that achieves substantially better performance on math word reasoning tasks over strong baselines and demonstrates the robustness of the methods under format perturbation and distribution shift.

On the Advance of Making Language Models Better Reasoners

This paper conducts extensive experiments using the latest language model code-davinci-002 and demonstrates that D I V E RS E can achieve new state-of-the-art performance on six out of eight reasoning benchmarks, out-performing the PaLM model with 540B parameters.

ReAct: Synergizing Reasoning and Acting in Language Models

ReAct overcomes prevalent issues of hallucination and error propagation in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generating human-like task-solving trajectories that are more interpretable than baselines without reasoning traces.

Solving Math Word Problem via Cooperative Reasoning induced Language Models

A cooperative reasoning-induced PLM for solving MWPs is developed, resulting in a human-like reasoning architecture with system 1 as the generator and system 2 as the verifier, and decent improvement over state-of-the-art methods, up to 9.8% increase over best baselines.

Large Language Models are Zero-Shot Reasoners

Experimental results demonstrate that the Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics, symbolic reasoning, and other logical reasoning tasks, without any hand-crafted few-shot examples.

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Experiments on symbolic manipulation, compositional generalization and numerical reasoning demonstrate that least-to-most prompting can generalize to examples that are harder than those seen in the prompt context, outperforming other prompting-based approaches by a large margin.

ThinkSum: Probabilistic reasoning over sets using large language models

It is argued that because the probabilistic inference in T HINK S UM is performed outside of calls to the LLM, it is less sensitive to prompt design, yields more interpretable predictions, and can be flexibly combined with latent variable models to extract structured knowledge from LLMs.

Solving math word problems with process- and outcome-based feedback

It is found that pure outcome-based supervision produces similar final-answer error rates with less label supervision, but for correct reasoning steps it is necessary to use processbased supervision or supervision from learned reward models that emulate process-based feedback.
...

References

SHOWING 1-10 OF 75 REFERENCES

Chain of Thought Prompting Elicits Reasoning in Large Language Models

Experiments on three large language models show that chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks.

Improving Coherence and Consistency in Neural Sequence Models with Dual-System, Neuro-Symbolic Reasoning

This work seeks a lightweight, training-free means of improving existing System 1-like sequence models by adding System 2-inspired logical reasoning and shows that this approach can increase the coherence and accuracy of neurally-based generations.

Large Language Models are Zero-Shot Reasoners

Experimental results demonstrate that the Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics, symbolic reasoning, and other logical reasoning tasks, without any hand-crafted few-shot examples.

Injecting Numerical Reasoning Skills into Language Models

This work shows that numerical reasoning is amenable to automatic data generation, and thus one can inject this skill into pre-trained LMs, by generating large amounts of data, and training in a multi-task setup.

The Unreliability of Explanations in Few-Shot In-Context Learning

A framework for calibrating model predictions based on the reliability of explanations is presented and it is shown that explanations judged as good by humans—those that are logically consistent with the input and the prediction—usually indicate more accurate predictions.

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

This paper presents an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher.

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Evaluation of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters finds that model performance and calibration both improve with scale, but are poor in absolute terms.

Measuring and Improving Consistency in Pretrained Language Models

The creation of PARAREL, a high-quality resource of cloze-style query English paraphrases, and analysis of the representational spaces of PLMs suggest that they have a poor structure and are currently not suitable for representing knowledge in a robust way.

Diversifying Content Generation for Commonsense Reasoning with Mixture of Knowledge Graph Experts

This paper proposes MoKGE, a novel method that diversifies the generative reasoning by a mixture of expert (MoE) strategy on commonsense knowledge graphs (KG) to encourage various generation outputs.

Training Verifiers to Solve Math Word Problems

It is demonstrated that verification significantly improves performance on GSM8K, and there is strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.
...