Imagine This! Scripts to Compositions to Videos

@inproceedings{Gupta2018ImagineTS,
  title={Imagine This! Scripts to Compositions to Videos},
  author={Tanmay Gupta and Dustin Schwenk and Ali Farhadi and Derek Hoiem and Aniruddha Kembhavi},
  booktitle={ECCV},
  year={2018}
}
Imagining a scene described in natural language with realistic layout and appearance of entities is the ultimate test of spatial, visual, and semantic world knowledge. [...] Key Method Our contributions include sequential training of components of CRAFT while jointly modeling layout and appearances, and losses that encourage learning compositional representations for retrieval. We evaluate CRAFT on semantic fidelity to caption, composition consistency, and visual quality. CRAFT outperforms direct pixel…Expand
StoryGAN: A Sequential Conditional GAN for Story Visualization
TLDR
A new story-to-image-sequence generation model, StoryGAN, based on the sequential conditional GAN framework is proposed, which outperformed state-of-the-art models in image quality, contextual consistency metrics, and human evaluation. Expand
Improving Generation and Evaluation of Visual Stories via Semantic Consistency
TLDR
A number of improvements to prior modeling approaches are presented, including the addition of a dual learning framework that utilizes video captioning to reinforce the semantic alignment between the story and generated images, a copy-transform mechanism for sequentially-consistent story visualization, and MART-based transformers to model complex interactions between frames. Expand
Holistic static and animated 3D scene generation from diverse text descriptions
TLDR
This work proposes a framework for holistic static and animated 3D scene generation from diverse text descriptions that leverages the rich contextual encoding which allows us to process a larger range of possible natural language descriptions. Expand
Text2Scene: Generating Compositional Scenes From Textual Descriptions
TLDR
Text2Scene is a model that generates various forms of compositional scene representations from natural language descriptions that is not only competitive when compared with state-of-the-art GAN-based methods using automatic metrics and superior based on human judgments but also has the advantage of producing interpretable results. Expand
Proposal : Learning Local Representations of Images and Text
Images and text inherently exhibit hierarchical structures, e.g. scenes built from objects, sentences built from words. In many computer vision and natural language processing tasks, learningExpand
Towards story-based classification of movie scenes
TLDR
The ability of machine learning approaches to detect the narrative structure in movies, which could lead to the development of story-related video analytics tools, such as automatic video summarization and recommendation systems, is demonstrated. Expand
Video Object Grounding Using Semantic Roles in Language Description
TLDR
This work investigates the role of object relations in VOG and proposes a novel framework VOGNet to encode multi-modal object relations via self-attention with relative position encoding, and proposes novel contrasting sampling methods to generate more challenging grounding input samples. Expand
PororoGAN: An Improved Story Visualization Model on Pororo-SV Dataset
TLDR
A new story visualization model named PororoGAN is proposed, which jointly considers story-to-image-sequence, sentence- to-image and word-to -image-patch alignment and outperforms the state-of-the-art models. Expand
CAGAN: Text-To-Image Generation with Combined Attention GANs
TLDR
It is demonstrated that judging a model by a single evaluation metric can be misleading by developing an additional model adding local self-attention which scores a higher IS, outperforming the state of the art on the CUB dataset, but generates unrealistic images through feature repetition. Expand
Sound2Sight: Generating Visual Dynamics from Sound and Context
TLDR
This paper presents Sound2Sight, a deep variational framework that is trained to learn a per frame stochastic prior conditioned on a joint embedding of audio and past frames, and proposes a multimodal discriminator that differentiates between a synthesized and a real audio-visual clip. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 39 REFERENCES
Attentive Semantic Video Generation Using Captions
TLDR
This paper proposes a network architecture to perform variable length semantic video generation using captions and shows that the network’s ability to learn a latent representation allows it to generate videos in an unsupervised manner and perform other tasks such as action recognition. Expand
Generating Videos with Scene Dynamics
TLDR
A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines. Expand
Generating Images from Captions with Attention
TLDR
It is demonstrated that the proposed model produces higher quality samples than other approaches and generates images with novel scene compositions corresponding to previously unseen captions in the dataset. Expand
Video Generation From Text
TLDR
Experimental results show that the proposed framework generates plausible and diverse videos, while accurately reflecting the input text information, significantly outperforms baseline models that directly adapt text-to-image generation procedures to produce videos. Expand
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
TLDR
This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic. Expand
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
TLDR
This paper presents Flickr30K Entities, which augments the 158k captions from Flickr30k with 244k coreference chains linking mentions of the same entities in images, as well as 276k manually annotated bounding boxes corresponding to each entity, essential for continued progress in automatic image description and grounded language understanding. Expand
VSE++: Improved Visual-Semantic Embeddings
TLDR
This paper introduces a very simple change to the loss function used in the original formulation by Kiros et al. (2014), which leads to drastic improvements in the retrieval performance, and shows that similar improvements also apply to the Order-embeddings by Vendrov etAl. Expand
DeViSE: A Deep Visual-Semantic Embedding Model
TLDR
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Expand
Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis
TLDR
This work proposes a novel hierarchical approach for text-to-image synthesis by inferring semantic layout and shows that the model can substantially improve the image quality, interpretability of output and semantic alignment to input text over existing approaches. Expand
Learning Robust Visual-Semantic Embeddings
TLDR
An end-to-end learning framework that is able to extract more robust multi-modal representations across domains and a novel technique of unsupervised-data adaptation inference is introduced to construct more comprehensive embeddings for both labeled and unlabeled data. Expand
...
1
2
3
4
...