• Publications
  • Influence
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
TLDR
It is found that pretrained LMs can degenerate into toxic text even from seemingly innocuous prompts, and empirically assess several controllable generation methods find that while data- or compute-intensive methods are more effective at steering away from toxicity than simpler solutions, no current method is failsafe against neural toxic degeneration. Expand