A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks

  title={A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks},
  author={Angela S. Lin and Sudha Rao and Asli Çelikyilmaz and Elnaz Nouri and Chris Brockett and Debadeepta Dey and Bill Dolan},
Many high-level procedural tasks can be decomposed into sequences of instructions that vary in their order and choice of tools. In the cooking domain, the web offers many, partially-overlapping, text and video recipes (i.e. procedures) that describe how to make the same dish (i.e. high-level task). Aligning instructions for the same dish across different sources can yield descriptive visual explanations that are far richer semantically than conventional textual instructions, providing… Expand
Aligning Actions Across Recipe Graphs
A novel and fully-parsed English recipe corpus, ARA (Aligned Recipe Actions), is presented, which annotates correspondences between individual actions across similar recipes with the goal of capturing information implicit for accurate recipe understanding. Expand
Substance over Style: Document-Level Targeted Content Transfer
A novel model is proposed based on the generative pre-trained language model (GPT-2) and trained on a large number of roughly-aligned recipe pairs that out-performs existing methods by generating coherent and diverse rewrites that obey the constraint while remaining close to the original document. Expand
Learning Video Models from Text: Zero-Shot Anticipation for Procedural Actions
A hierarchical model that generalizes instructional knowledge from large-scale text-corpora and transfers the knowledge to video recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language. Expand
Cookpad Parsed Corpus: Linguistic Annotations of Japanese Recipes
It has become increasingly common for people to share cooking recipes on the Internet. Along with the increase in the number of shared recipes, there have been corresponding increases inExpand
Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions
A comprehensive survey on the emerging area of multimodal co-learning that has not been explored in its entirety yet, and presents the comprehensive taxonomy of multi-modalities based on the challenges addressed by co- learning and associated implementations. Expand


Weakly-Supervised Alignment of Video with Text
This paper proposes a method for aligning the two modalities of video and text, i.e., automatically providing a time (frame) stamp for every sentence, and formulate this problem as an integer quadratic program, and solve its continuous convex relaxation using an efficient conditional gradient algorithm. Expand
COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis
A simple yet effective method to capture the dependencies among different steps, which can be easily plugged into conventional proposal-based action detection methods for localizing important steps in instruction videos is proposed. Expand
Unsupervised Learning from Narrated Instruction Videos
A new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration to solve two clustering problems, one in text and one in video, and can automatically discover the main steps to achieve the task and locate the steps in the input videos. Expand
Towards Automatic Learning of Procedures From Web Instructional Videos
A segment-level recurrent network is proposed for generating procedure segments by modeling the dependencies across segments and it is shown that the proposed model outperforms competitive baselines in procedure segmentation. Expand
Cross-Task Weakly Supervised Learning From Instructional Videos
The experiments demonstrate that sharing across tasks improves performance, especially when done at the component level and that the component model can parse previously unseen tasks by virtue of its compositionality. Expand
Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images
It is demonstrated that regularization via the addition of a high-level classification objective both improves retrieval performance to rival that of humans and enables semantic vector arithmetic. Expand
YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension
This work introduces “YouMakeup”, a large-scale multimodal instructional video dataset to support fine-grained semantic comprehension research in specific domain and proposes two groups of tasks including generation tasks and visual question answering from different aspects. Expand
Unsupervised Alignment of Actions in Video with Text Descriptions
A two-step process that creates a high-level action feature codebook with temporally consistent motions, and then applies an unsupervised alignment algorithm over the action codewords and verbs in the language to identify individual activities is proposed. Expand
Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books
To align movies and books, a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book are proposed. Expand
Unsupervised Alignment of Natural Language Instructions with Video Segments
An unsupervised learning algorithm for automatically inferring the mappings between English nouns and corresponding video objects and two generative models that are closely related to the HMM and IBM 1 word alignment models used in statistical machine translation are proposed. Expand