• Corpus ID: 211132970

Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings

  title={Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings},
  author={Shweta Mahajan and Iryna Gurevych and Stefan Roth},
Learned joint representations of images and text form the backbone of several important cross-domain tasks such as image captioning. Prior work mostly maps both domains into a common latent representation in a purely supervised fashion. This is rather restrictive, however, as the two domains follow distinct generative processes. Therefore, we propose a novel semi-supervised framework, which models shared information between domains and domain-specific information separately. The information… 

Diverse Image Captioning with Context-Object Split Latent Spaces

This work introduces a novel factorization of the latent space, termed context-object split, to model diversity in contextual descriptions across images and texts within the dataset, and extends this to images with novel objects and without paired captions in the training data.

Cross-Domain Latent Modulation for Variational Transfer Learning

A cross-domain latent modulation mechanism within a variational autoencoders (VAE) framework to enable improved transfer learning and shows competitive performance in unsupervised domain adaptation and image-to-image translation.

Variational Transfer Learning using Cross-Domain Latent Modulation

This work proposes to introduce a novel cross-domain latent modulation mechanism to a variational autoencoder framework so as to achieve effective transfer learning and demonstrates competitive performance.


This work demonstrates that with only architectural inductive biases, a generative model with a likelihood-based objective is capable of learning decoupled representations, requiring no explicit supervision.

Gradual Domain Adaptation via Normalizing Flows

This work generates pseudo intermediate domains from normalizing flows and then uses them for gradual domain adaptation, which mitigates the above-explained problem and improves the classification performance.

Constrained Density Matching and Modeling for Cross-lingual Alignment of Contextualized Representations

This work introduces supervised and unsupervised density-based approaches named Real-NVP and GAN-Real- NVP, driven by Normalizing Flow, to perform alignment, both dissecting the alignment of multilingual subspaces into density matching and density modeling, and complement these approaches with the validation criteria in order to guide the training process.

Learning Distinct and Representative Modes for Image Captioning

The innovative idea is to explore the rich modes in the training caption corpus to learn a set of mode embeddings, and further use them to control the mode of the generated captions for existing image captioning models, leading to better performance for both diversity and quality on the MSCOCO dataset.

Diverse Image Captioning with Grounded Style

The limitations of current stylized captioning datasets are analyzed and COCO attribute-based augmentations are proposed to obtain varied stylized captions from C OCO annotations to generate accurate captions with diversity in styles that are grounded in the image.


    Computer Science
  • 2020
This paper shows that a mapping to a RKHS which subsequently enables deploying mature ideas from the kernel methods literature for flow-based generative models, and empirically shows that this simple idea yields competitive results on popular datasets such as CelebA and promising results on a public 3D brain imaging dataset where the sample sizes are much smaller.

SceneTrilogy: On Scene Sketches and its Relationship with Text and Photo

. We for the first time extend multi-modal scene understanding to include that of free-hand scene sketches. This uniquely results in a trilogy of scene data modalities (sketch, text, and photo),

M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention

A unified model, M3D-GAN, that can translate across a wide range of modalities and domains, and introduces a universal attention module that is jointly trained with the whole network and learns to encode a large range of domain information into a highly structured latent space.

Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

This work proposes Seq-CVAE which learns a latent space for every word which encourages this temporal latent space to capture the 'intention' about how to complete the sentence by mimicking a representation which summarizes the future.

Towards Diverse and Natural Image Descriptions via a Conditional GAN

A new framework based on Conditional Generative Adversarial Networks (CGAN) is proposed, which jointly learns a generator to produce descriptions conditioned on images and an evaluator to assess how well a description fits the visual content.

Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space

Two models are proposed that explicitly structure the latent space around $K$ components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g., several kinds of objects).

Deep Visual-Semantic Alignments for Generating Image Descriptions

    A. KarpathyLi Fei-Fei
    Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2017
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.

Semantics Disentangling for Text-To-Image Generation

A novel photo-realistic text-to-image generation model that implicitly disentangles semantics to both fulfill the high- level semantic consistency and low-level semantic diversity and a visual-semantic embedding strategy by semantic-conditioned batch normalization to find diverse low- level semantics.

Learning Two-Branch Neural Networks for Image-Text Matching Tasks

This paper investigates two-branch neural networks for learning the similarity between image-sentence matching and region-phrase matching, and proposes two network structures that produce different output representations.

Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models

This work proposes to incorporate generative processes into the cross-modal feature embedding, through which it is able to learn not only the global abstract features but also the local grounded features of image-text pairs.

Latent Normalizing Flows for Discrete Sequences

A VAE-based generative model is proposed which jointly learns a normalizing flow-based distribution in the latent space and a stochastic mapping to an observed discrete space in this setting, finding that it is crucial for the flow- based distribution to be highly multimodal.

NICE: Non-linear Independent Components Estimation

We propose a deep learning framework for modeling complex high-dimensional densities called Non-linear Independent Component Estimation (NICE). It is based on the idea that a good representation is