• Corpus ID: 235727350

Tight Mutual Information Estimation With Contrastive Fenchel-Legendre Optimization

  title={Tight Mutual Information Estimation With Contrastive Fenchel-Legendre Optimization},
  author={Qing Guo and Junya Chen and Dong Wang and Yuewei Yang and Xinwei Deng and Lawrence Carin and Fan Li and Chenyang Tao},
Successful applications of InfoNCE (Information Noise-Contrastive Estimation) and its variants have popularized the use of contrastive variational mutual information (MI) estimators in machine learning . While featuring superior stability, these estimators crucially depend on costly large-batch training, and they sacrifice bound tightness for variance reduction. To overcome these limitations, we revisit the mathematics of popular variational MI bounds from the lens of unnormalized statistical… 

Figures and Tables from this paper

Simpler, Faster, Stronger: Breaking The log-K Curse On Contrastive Learners With FlatNCE
This work reveals mathematically why contrastive learners fail in the small-batch-size regime, and presents a novel simple, non-trivial contrastive objective named FlatNCE, which fixes this issue.
Quantifying the Task-Specific Information in Text-Based Classifications
Recently, neural natural language models 001 have attained state-of-the-art performance on 002 a wide variety of tasks, but the high perfor003 mance can result from superficial, surface004 level cues
Supercharging Imbalanced Data Learning With Energy-based Contrastive Representation Transfer
This work posits a meta-distributional scenario, where the causal generating mechanism for label-conditional features is invariant across different labels, which enables efficient knowledge transfer from the dominant classes to their under-represented counterparts, even if their feature distributions show apparent disparities.


Demystifying fixed k-nearest neighbor information estimators
It is demonstrated that the KSG estimator is consistent and an upper bound on the rate of convergence of the ℓ2 error as a function of number of samples is identified, and it is argued that the performance benefits of the KSg estimator stems from a curious “correlation boosting” effect.
Understanding the Limitations of Variational Mutual Information Estimators
A new estimator is developed that focuses on variance reduction and theoretically shows that, under some conditions, estimators such as MINE exhibit variance that could grow exponentially with the true amount of underlying MI.
On Variational Bounds of Mutual Information
This work introduces a continuum of lower bounds that encompasses previous bounds and flexibly trades off bias and variance and demonstrates the effectiveness of these new bounds for estimation and representation learning.
Mutual Information Gradient Estimation for Representation Learning
This work argues that estimating gradients of MI is more appealing for representation learning than directly estimating MI due to the difficulty of estimating MI and proposes the Mutual Information Gradient Estimator (MIGE), which exhibits a tight and smooth gradient estimation of MI in the high-dimensional and large-MI setting.
FERMI: Fair Empirical Risk Minimization via Exponential Rényi Mutual Information
It is proved that FERMI converges for demographic parity, equalized odds, and equal opportunity notions of fairness in stochastic optimization, and achieves the most favorable tradeoffs between fairness violation and test accuracy across all tested setups.
Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
A new estimation principle is presented to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise, using the model log-density function in the regression nonlinearity, which leads to a consistent (convergent) estimator of the parameters.
Noise Contrastive Estimation and Negative Sampling for Conditional Models: Consistency and Statistical Efficiency
It is shown that the ranking-based variant of NCE gives consistent parameter estimates under weaker assumptions than the classification-based method, which is closely related to negative sampling methods, now widely used in NLP.
On Fenchel Mini-Max Learning
A novel probabilistic learning framework, called Fenchel Mini-Max Learning (FML), is presented, that accommodates all four desiderata in a flexible and scalable manner and overcomes a longstanding challenge that prevents unbiased estimation of unnormalized statistical models.
On Mutual Information Maximization for Representation Learning
This paper argues, and provides empirical evidence, that the success of these methods cannot be attributed to the properties of MI alone, and that they strongly depend on the inductive bias in both the choice of feature extractor architectures and the parametrization of the employed MI estimators.
Auto-Encoding Variational Bayes
A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced.