Corpus ID: 233219638

Visual Goal-Step Inference using wikiHow

  title={Visual Goal-Step Inference using wikiHow},
  author={Yue Yang and Artemis Panagopoulou and QING LYU and Li Zhang and Mark Yatskar and Chris Callison-Burch},
Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images… Expand
Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval
This work proposes a novel system that induces schemata from web videos and generalizes them to capture unseen tasks with the goal of improving video retrieval performance, and demonstrates that the schemATA induced by the system are better than those generated by other models. Expand
Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals
This work benchmarks models’ capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from popular online instructional manuals and collecting comprehensive human annotations, and proposes sequentiality-aware pretraining techniques that exploit the sequential alignment properties of both texts and images. Expand


Learning Household Task Knowledge from WikiHow Descriptions
A model to learn embeddings for tasks, as well as the individual steps that need to be taken to solve them, based on WikiHow articles are proposed, such that they are predictive of both step relevance and step ordering. Expand
Learning Procedures from Text: Codifying How-to Procedures in Deep Neural Networks
To identify the relationships, this paper proposes an end-to-end neural network architecture, which can selectively learn important procedure-specific relationships and outperforms the existing entity relationship extraction algorithms. Expand
Recent Trends in Natural Language Understanding for Procedural Knowledge
This paper seeks to provide an overview of the work in procedural knowledge understanding, and information extraction, acquisition, and representation with procedures, to promote discussion and provide a better understanding of procedural knowledge applications and future challenges. Expand
Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge
A generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image is presented. Expand
Joint Slot Filling and Intent Detection via Capsule Neural Networks
A capsule-based neural network model is proposed which accomplishes slot filling and intent detection via a dynamic routing-by-agreement schema and a re-routing schema is proposed to further synergize the slot filling performance using the inferred intent representation. Expand
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Expand
Connecting the Dots: Event Graph Schema Induction with Path Language Modeling
This work proposes a new Event Graph Schema, where two event types are connected through multiple paths involving entities that fill important roles in a coherent story, and introduces Path Language Model, an auto-regressive language model trained on event-event paths, to select salient and coherent paths to probabilistically construct these graph schemas. Expand
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
A Fully Convolutional Localization Network (FCLN) architecture is proposed that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with asingle round of optimization. Expand
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
This work introduces a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data and introduces a structured max-margin objective that allows this model to explicitly associate fragments across modalities. Expand
DeViSE: A Deep Visual-Semantic Embedding Model
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Expand