Share This Author
MERLOT: Multimodal Neural Script Knowledge Models
This work introduces MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech – in an entirely label-free, self-supervised manner, and achieves state-ofthe-art performance on 12 different video QA datasets when finetuned.
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
The surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references is reported.
Something’s Brewing! Early Prediction of Controversy-causing Posts from Discussion Features
Using data from several different communities on reddit.com, this work predicts the ultimate controversiality of posts, leveraging features drawn from both the textual content and the tree structure of the early comments that initiate the discussion.
A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions
It is found that unstated background information is better explained by visual features, whereas fine-grained distinctions are disambiguated more easily via ASR tokens.
Quantifying the Visual Concreteness of Words and Topics in Multimodal Datasets
This work gives an algorithm for automatically computing the visual concreteness of words and topics within multimodal datasets, and predicts the capacity of machine learning algorithms to learn textual/visual relationships.
Unsupervised Discovery of Multimodal Links in Multi-image, Multi-sentence Documents
A structured training objective based on identifying whether collections of images and sentences co-occur in documents can suffice to predict links between specific sentences and specific images within the same document at test time is found.
Does My Multimodal Model Learn Cross-modal Interactions? It’s Harder to Tell than You Might Think!
A new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task, and recommends that researchers in multimodal machine learning report the performance not only of unimodal baselines, but also the EMAP of their best-performing model.
Symbolic Knowledge Distillation: from General Language Models to Commonsense Models
It is demonstrated that careful prompt engineering and a separately trained critic model allow us to selectively distill high-quality causal commonsense from GPT-3, a general language model, and results in a neural commonsense model that surpasses the teacher model's commonsense capabilities despite its 100x smaller size.
Image Representations and New Domains in Neural Image Captioning
By varying image representation quality produced by a convolutional neural network, it is found that a state-of-the-art neural captioning algorithm is able to produce quality captions even when provided with surprisingly poor image representations.
Reframing Human-AI Collaboration for Generating Free-Text Explanations
- Sarah Wiegreffe, Jack Hessel, Swabha Swayamdipta, Mark O. Riedl, Yejin Choi
- Computer ScienceNAACL
- 16 December 2021
This work creates a pipeline that combines GPT-3 with a supervised filter that incorporates binary acceptability judgments from humans in the loop and demonstrates that acceptability is partially correlated with various fine-grained attributes of explanations.