• Corpus ID: 213969759

Mutual Information Gradient Estimation for Representation Learning

  title={Mutual Information Gradient Estimation for Representation Learning},
  author={Liangjiang Wen and Yiji Zhou and Lirong He and Mingyuan Zhou and Zenglin Xu},
Mutual information (MI) plays an important role in representation learning. However, MI is unfortunately intractable in continuous and high-dimensional settings. Recent advances establish tractable and scalable MI estimators to discover useful representation. However, most of existing methods are not capable of providing accurate estimation of MI with low-variance when the MI is large. We argue that estimating gradients of MI is more appealing for representation learning than directly… 

Figures and Tables from this paper

Tight Mutual Information Estimation With Contrastive Fenchel-Legendre Optimization
This work revisits the mathematics of popular variational MI bounds from the lens of unnormalized statistical modeling and convex optimization, and results in a novel, simple, and powerful contrastive MI estimator, named FLO.
Combating the Instability of Mutual Information-based Losses via Regularization
This work identifies the symptoms behind MI-based losses' instability and mitigates both issues by adding a novel regularization term to the existing losses, and theoretically and experimentally demonstrates that added regularization stabilizes training.
Nonparametric Score Estimators
This work proposes score estimators based on iterative regularization that enjoy computational benefits from curl-free kernels and fast convergence and provides a unifying view of these estimators under the framework of regularized nonparametric regression.
AR-DAE: Towards Unbiased Neural Entropy Gradient Estimation
This paper proposes the amortized residual denoising autoencoder (AR-DAE) to approximate the gradient of the log density function, which can be used to estimate thegradient of entropy.
Self-Supervision Can Be a Good Few-Shot Learner
This work proposes an effective unsupervised FSL method, learning representations with self-supervision, following the InfoMax principle, which achieves comparable performance on widely used FSL benchmarks without any labels of the base classes.
Barycentric-alignment and invertibility for domain generalization
A new upper bound is derived for Domain Generalization (DG) problem, where the hypotheses are composed of a common representation mapping followed by a labeling function, by imposing mild assumptions on the loss function and an invertibility requirement on the representation map when restricted to the low-dimensional data manifold.
Neural Approximate Sufficient Statistics for Implicit Models
We consider the fundamental problem of how to automatically construct summary statistics for implicit generative models where the evaluation of likelihood function is intractable but sampling /
Barycenteric distribution alignment and manifold-restricted invertibility for domain generalization
A new representation learning cost for DG is motivated that additively balances three competing objectives: 1) minimizing classification error across seen domains via cross entropy, 2) enforcing domain-invariance in the representation space via the Wasserstein-2 barycenter cost, and 3) promoting non-degenerate, nearly-invertible representation via one of two mechanisms.


On Mutual Information Maximization for Representation Learning
This paper argues, and provides empirical evidence, that the success of these methods cannot be attributed to the properties of MI alone, and that they strongly depend on the inductive bias in both the choice of feature extractor architectures and the parametrization of the employed MI estimators.
Learning deep representations by mutual information estimation and maximization
It is shown that structure matters: incorporating knowledge about locality in the input into the objective can significantly improve a representation’s suitability for downstream tasks and is an important step towards flexible formulations of representation learning objectives for specific end-goals.
The Role of the Information Bottleneck in Representation Learning
This work derives an upper bound to the so-called generalization gap corresponding to the cross-entropy loss and shows that when this bound times a suitable multiplier and the empirical risk are minimized jointly, the problem is equivalent to optimizing the Information Bottleneck objective with respect to the empirical data-distribution.
Approximating Mutual Information by Maximum Likelihood Density Ratio Estimation
This paper proposes a new method of approximating mutual information based on maximum likelihood estimation of a density ratio function, called Maximum Likelihood Mutual Information (MLMI), which has several attractive properties, e.g., density estimation is not involved, it is a single-shot procedure, the global optimal solution can be efficiently computed, and cross-validation is available for model selection.
Learning Discrete Representations via Information Maximizing Self-Augmented Training
In IMSAT, data augmentation is used to impose the invari-ance on discrete representations and the predicted representations of augmented data points to be close to those of the original data points in an end-to-end fashion to maximize the information-theoretic dependency between data and their predicted discrete representations.
On Variational Bounds of Mutual Information
This work introduces a continuum of lower bounds that encompasses previous bounds and flexibly trades off bias and variance and demonstrates the effectiveness of these new bounds for estimation and representation learning.
Representation Learning with Contrastive Predictive Coding
This work proposes a universal unsupervised learning approach to extract useful representations from high-dimensional data, which it calls Contrastive Predictive Coding, and demonstrates that the approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.
Relevant sparse codes with variational information bottleneck
This work proposes an approximate variational scheme for maximizing a lower bound on the IB objective, analogous to variational EM, and derives an IB algorithm to recover features that are both relevant and sparse.
A Spectral Approach to Gradient Estimation for Implicit Distributions
A gradient estimator for implicit distributions based on Stein's identity and a spectral decomposition of kernel operators, where the eigenfunctions are approximated by the Nystrom method, which allows for a simple and principled out-of-sample extension.
Fixing a Broken ELBO
This framework derives variational lower and upper bounds on the mutual information between the input and the latent variable, and uses these bounds to derive a rate-distortion curve that characterizes the tradeoff between compression and reconstruction accuracy.