Share This Author
Experiments show the self-attention model greatly outperforms others, creating a strong baseline for future research in the general task of text infilling, where the input text can have an arbitrary number of portions to be filled, each of which may require an arbitrary unknown number of tokens.
Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation
This paper first enrichs the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external outdoor navigation dataset, and significantly outperforms the baseline models on the outdoor VLN task.
Texar: A Modularized, Versatile, and Extensible Toolbox for Text Generation
This work introduces Texar, an open-source toolkit aiming to support the broad set of text generation tasks by abstracting the common patterns underlying the diverse tasks and methodologies, creating a library of highly reusable modules and functionalities, and enabling arbitrary model architectures and various algorithmic paradigms.
Diagnosing Vision-and-Language Navigation: What Really Matters
This work conducts a series of diagnostic experiments to unveil agents’ focus during navigation and shows that indoor navigation agents refer to both object and direction tokens when making decisions, and Transformer-based agents acquire a better cross-modal understanding of objects and display strong numerical reasoning ability than non-Transformer- based agents.
Towards Understanding Sample Variance in Visually Grounded Language Generation: Evaluations and Observations
- Wanrong Zhu, Xin Wang, P. Narayana, Kazoo Sone, Sugato Basu, William Yang Wang
- Computer ScienceEMNLP
- 7 October 2020
A major challenge in visually grounded language generation is to build robust benchmark datasets and models that can generalize well in real-world settings. To do this, it is critical to ensure that…
ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation
Experiments demonstrate that adding imagination with the proposed I MAGIN E displays great potential in introducing multi-modal in- 019 formation into NLG evaluation, and improves existing automatic metrics’ correlations with human similarity judgments in many circum- 022 stances.
Neuro-Symbolic Procedural Planning with Commonsense Prompting
Procedural planning aims to implement complex high-level goals by decomposition into sequential simpler low-level steps. Although procedural planning is a basic skill set for humans in daily life, it…
End-to-end Dense Video Captioning as Sequence Generation
- Wanrong Zhu, Bo Pang, Ashish V. Thapliyal, William Yang Wang, Radu Soricut
- Computer ScienceArXiv
- 18 April 2022
This work shows how to model the two 013 subtasks of dense video captioning jointly as one sequence generation task, and simultane- 015 predict the events and the corresponding 016 descriptions.
Imagination-Augmented Natural Language Understanding
iACE enables visual imagination with external knowledge transferred from the powerful generative and pre-trained vision-and-language models to solve natural language understanding tasks from a novel learning perspective—imagination-augmented cross-modal understanding.
Neuro-Symbolic Causal Language Planning with Commonsense Prompting
A Neuro-Symbolic Causal Language Planner (CLAP) is proposed that elicits procedural knowledge from the LLMs with commonsense-infused prompting to solve the language planning problem in a zero-shot manner.