Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks from The New Yorker Caption Contest

@article{Hessel2022DoAL,
  title={Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks from The New Yorker Caption Contest},
  author={Jack Hessel and Ana Marasovi{\'c} and Jena D. Hwang and Lillian Lee and Jeff Da and Rowan Zellers and Robert Mankoff and Yejin Choi},
  journal={ArXiv},
  year={2022},
  volume={abs/2209.06293}
}
We challenge AI models to “demonstrate un-derstanding” of the sophisticated multimodal humor of The New Yorker Caption Contest. Concretely, we develop three carefully cir-cumscribed tasks for which it suffices (but is not necessary) to grasp potentially complex and unexpected relationships between image and caption, and similarly complex and unexpected allusions to the wide varieties of human experience; these are the hallmarks of a New Yorker -caliber cartoon. We investigate vision-and-language… 

WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

This work introduces WinoGAViL: an online game of vision-and-language associations used as a dynamic evaluation benchmark, and indicates that the collected associations require diverse reasoning skills, including general knowledge, common sense, abstraction, and more.

FLUTE: Figurative Language Understanding through Textual Explanations

FLUTE, a dataset of 9,000 figurative NLI instances with explanations, spanning four categories: Sarcasm, Simile, Metaphor, and Idioms, is released, and it is shown how utilizing GPT-3 in conjunction with human annotators can aid in scaling up the creation of datasets even for such complex linguistic phenomena as flgurative language.

Crowd Score: A Method for the Evaluation of Jokes using Large Language Model AI Voters as Judges

The Crowd Score is presented, a novel method to assess the funniness of jokes using large language models (LLMs) as AI judges and it shows that few-shot prompting leads to better results than zero-shot for the voting question and aggressive and self-defeating voters are more inclined to find more jokes funny of a set of aggressive/self- defeating jokes.

References

SHOWING 1-10 OF 76 REFERENCES

Neural Joking Machine : Humorous image captioning

An image caption that draws a "laugh" by a computer is generated and a system that outputs funny captions based on the image caption proposed in the computer vision field is constructed and the Funny Score, which flexibly gives weights according to an evaluation database is proposed.

Reframing Human-AI Collaboration for Generating Free-Text Explanations

This work creates a pipeline that combines GPT-3 with a supervised filter that incorporates binary acceptability judgments from humans in the loop and demonstrates that acceptability is partially correlated with various fine-grained attributes of explanations.

FLUTE: Figurative Language Understanding and Textual Explanations

FLUTE is released, a dataset of 8,000 figurative NLI instances with explanations, spanning three cat-egories: Sarcasm, Simile, and Metaphor, and it is shown how uti-lizing GPT-3 in conjunction with human experts can aid in scaling up the creation of datasets even for such complex linguistic phe-nomena as flgurative language.

Punny Captions: Witty Wordplay in Image Descriptions

In a Turing test style evaluation, people find the image descriptions generated by the model to be slightly wittier than human-written witty descriptions when the human is subject to similar constraints as the model regarding word usage and style.

We are Humor Beings: Understanding and Predicting Visual Humor

This work analyzes the humor manifested in abstract scenes and design computational models for them, and model two tasks that it is believed demonstrate an understanding of some aspects of visual humor.

Humor Knowledge Enriched Transformer for Understanding Multimodal Humor

This paper proposes Humor Knowledge enriched Transformer that can capture the gist of a multimodal humorous expression by integrating the preceding context and external knowledge, and incorporates humor centric external knowledge into the model by capturing the ambiguity and sentiment present in the language.

Does My Multimodal Model Learn Cross-modal Interactions? It’s Harder to Tell than You Might Think!

A new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task, and recommends that researchers in multimodal machine learning report the performance not only of unimodal baselines, but also the EMAP of their best-performing model.

ColBERT: Using BERT Sentence Embedding for Humor Detection

A novel approach for detecting humor in short texts using BERT sentence embedding that sends embedding outputs as input to a two-layered neural network that predicts the target value.

Humor in Collective Discourse: Unsupervised Funniness Detection in the New Yorker Cartoon Caption Contest

It is shown that negative sentiment, human-centeredness, and lexical centrality most strongly match the funniest captions, followed by positive sentiment.

Multimodal Humor Dataset: Predicting Laughter tracks for Sitcoms

A novel model is developed that is a multi-modal self-attention based model that outperforms currently prevalent models for solving this task of adding laughter tracks to situational comedies.
...