Self-Consistency Improves Chain of Thought Reasoning in Language Models

@article{Wang2022SelfConsistencyIC,
  title={Self-Consistency Improves Chain of Thought Reasoning in Language Models},
  author={Xuezhi Wang and Jason Wei and Dale Schuurmans and Quoc Le and Ed Chi and Denny Zhou},
  journal={ArXiv},
  year={2022},
  volume={abs/2203.11171}
}
We explore a simple ensemble strategy, self-consistency , that significantly improves the reasoning accuracy of large language models. The idea is to sample a diverse set of reasoning paths from a language model via chain of thought prompting then return the most consistent final answer in the set. We evaluate self-consistency on a range of arithmetic and commonsense reasoning benchmarks, and find that it robustly improves accuracy across a variety of language models and model scales without the… 

Chain of Thought Prompting Elicits Reasoning in Large Language Models

TLDR
Experiments show that inducing a chain of thought via prompting can be enabled byently large language models to better perform reasoning tasks that otherwise have flat scaling curves.

On the Advance of Making Language Models Better Reasoners

TLDR
This paper conducts extensive experiments using the latest language model code-davinci-002 and demonstrates that D I V E RS E can achieve new state-of-the-art performance on six out of eight reasoning benchmarks, out-performing the PaLM model with 540B parameters.

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

TLDR
Experiments on symbolic manipulation, compositional generalization and numerical reasoning demonstrate that least-to-most prompting can generalize to examples that are harder than those seen in the prompt context, outperforming other prompting-based approaches by a large margin.

Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change)

TLDR
This work proposes an extensible assessment framework to test the abilities of LLMs on a central aspect of human intelligence, which is reasoning about actions and change and provides multiple test cases that are more involved than any of the previously established reasoning benchmarks.

Large Language Models are Zero-Shot Reasoners

TLDR
Experimental results demonstrate that the Zero-shot-CoT, using the same single prompt template, outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics, symbolic reasoning, and other logical reasoning tasks, without any hand-crafted few-shot examples.

Language models show human-like content effects on reasoning

TLDR
This work hypothesized that language models would show human-like content content on abstract reasoning problems, and explored this hypothesis across three logical reasoning tasks: natural language inference, judging the logical validity of syllogisms, and the Wason selection task.

Solving Quantitative Reasoning Problems with Language Models

Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks

Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations

TLDR
This work de-velops M AIEUTIC PROMPTING, which infers a correct answer to a question even from the noisy and inconsistent generations of LM, and improves robustness in inference while providing interpretable rationales.

Inferring Implicit Relations with Language Models

TLDR
This work investigates why current models struggle with implicit reasoning question answering (QA) tasks, by decoupling inference of reasoning steps from their execution, and suggests that the bottleneck for answering implicit reasoning questions is in the ability of language models to retrieve and reason over information rather than to plan an accurate reasoning strategy.

Rationale-Augmented Ensembles in Language Models

TLDR
It is demonstrated that rationale-augmented ensembles achieve more accurate and interpretable results than existing prompting approaches—including standard prompting without rationales and rationale-based chain-of-thought prompting—while simultaneously improving interpretability of model predictions through the associated rationales.

References

SHOWING 1-10 OF 61 REFERENCES

Chain of Thought Prompting Elicits Reasoning in Large Language Models

TLDR
Experiments show that inducing a chain of thought via prompting can be enabled byently large language models to better perform reasoning tasks that otherwise have flat scaling curves.

Improving Coherence and Consistency in Neural Sequence Models with Dual-System, Neuro-Symbolic Reasoning

TLDR
This work seeks a lightweight, training-free means of improving existing System 1-like sequence models by adding System 2-inspired logical reasoning and shows that this approach can increase the coherence and accuracy of neurally-based generations.

Injecting Numerical Reasoning Skills into Language Models

TLDR
This work shows that numerical reasoning is amenable to automatic data generation, and thus one can inject this skill into pre-trained LMs, by generating large amounts of data, and training in a multi-task setup.

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

TLDR
This paper presents an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher.

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

TLDR
Evaluation of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters finds that model performance and calibration both improve with scale, but are poor in absolute terms.

Measuring and Improving Consistency in Pretrained Language Models

TLDR
The creation of PARAREL, a high-quality resource of cloze-style query English paraphrases, and analysis of the representational spaces of PLMs suggest that they have a poor structure and are currently not suitable for representing knowledge in a robust way.

Diversifying Content Generation for Commonsense Reasoning with Mixture of Knowledge Graph Experts

TLDR
This paper proposes MoKGE, a novel method that diversifies the generative reasoning by a mixture of expert (MoE) strategy on commonsense knowledge graphs (KG) to encourage various generation outputs.

Training Verifiers to Solve Math Word Problems

TLDR
It is demonstrated that verification significantly improves performance on GSM8K, and there is strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.

Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention

TLDR
It is found that the proposed external attention mechanism can significantly improve the performance of existing AI systems, allowing practitioners to easily customize foundation AI models to many diverse downstream applications.

LaMDA: Language Models for Dialog Applications

TLDR
It is demonstrated that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding.
...