R-U-SURE? Uncertainty-Aware Code Suggestions By Maximizing Utility Across Random User Intents

  title={R-U-SURE? Uncertainty-Aware Code Suggestions By Maximizing Utility Across Random User Intents},
  author={Daniel D. Johnson and Daniel Tarlow and Christian J. Walder},
Large language models show impressive results at predicting structured text such as code, but also commonly introduce errors and hallucinations in their output. When used to assist software developers, these models may make mistakes that users must go back and fix, or worse, introduce subtle bugs that users may miss entirely. We propose Randomized Utility-driven Synthesis of Uncertain REgions (R-U-SURE), an approach for building uncertainty-aware suggestions based on a decision-theoretic model… 



Training language models to follow instructions with human feedback

The results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent and showing improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.

Perfection Not Required? Human-AI Partnerships in Code Translation

This study highlights how UI features such as confidence highlighting and alternate translations help software engineers work with and better understand generative NMT models.

Uncertainty Quantification with Pre-trained Language Models: A Large-Scale Empirical Analysis

A wide range of popular options for each consideration are compared based on three prevalent NLP classification tasks and the setting of domain shift to form a holistic analysis on how to compose a well-calibrated PLM-based prediction pipeline.

Fine-Tuning Language Models from Human Preferences

This paper builds on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets.

Large Language Models Can Self-Improve

This work uses a pre-trained LLM to generate “high-confidence” rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and conducts ablation studies and shows that ablation on reasoning is critical for self-improvement.

Sampling-Based Minimum Bayes Risk Decoding for Neural Machine Translation

This work shows on three language pairs that MBR can improve upon beam search with moderate computation, and shows that it can be beneficial to make use of strategies like beam search and nucleus sampling to construct hypothesis spaces efficiently.

Learning to Complete Code with Sketches

GRAMMFORMER is developed, a Transformer-based model that guides code generation by the programming language grammar, and is compared to a variety of more standard sequence models to evaluate models and measure success of generating completions matching long outputs with as few holes as possible.

Plex: Towards Reliability using Pretrained Large Model Extensions

The reliability of models is explored, where a reliable model is defined as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks involving uncertainty, robust generalization, and adaptation.

Investigating Explainability of Generative AI for Code through Scenario-based Design

This work explores explainability needs for GenAI for code and demonstrates how human-centered approaches can drive the technical development of XAI in novel domains.

DiverseNet: When One Right Answer is not Enough

This work introduces a simple method for training a neural network, which enables diverse structured predictions to be made for each test-time query, and compares favorably to methods that seek diversity through an ensemble of networks.