• Corpus ID: 88520815

Dealing with a large number of classes -- Likelihood, Discrimination or Ranking?

  title={Dealing with a large number of classes -- Likelihood, Discrimination or Ranking?},
  author={David Barber and Aleksandar Botev},
  journal={arXiv: Machine Learning},
We consider training probabilistic classifiers in the case of a large number of classes. The number of classes is assumed too large to perform exact normalisation over all classes. To account for this we consider a simple approach that directly approximates the likelihood. We show that this simple approach works well on toy problems and is competitive with recently introduced alternative non-likelihood based approximations. Furthermore, we relate this approach to a simple ranking objective… 

Figures from this paper

Listwise Learning to Rank by Exploring Unique Ratings
This paper proposes new listwise learning-to-rank models that mitigate the shortcomings of existing ones and proposes a novel and efficient way of refining prediction scores by combining an adapted Vanilla Recurrent Neural Network (RNN) model with pooling given selected documents at previous steps.
Variable length word encodings for neural translation models
BPE is better than character-level decomposition as it is more e cient, especially for the most frequent words, and it gives higher BLEU scores than Hu↵man coding and is able to decompose new words in the language.
S-Rocket: Selective Random Convolution Kernels for Time Series Classification
The kernels selection process is modeled as an optimization problem and a population-based approach is proposed for selecting the most important kernels and shows that on average it can achieve a similar performance to the original models by pruning more than 60% of kernels.
A Framework for Pruning Deep Neural Networks Using Energy-Based Models
  • H. Salehinejad, S. Valaee
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
This paper proposes a framework for pruning DNNs based on a population-based global optimization method that can use any pruning objective function and proposes a simple but efficient objective function based on the concept of energy-based models.
EDropout: Energy-Based Dropout and Pruning of Deep Neural Networks
Inspired by the dropout concept, EDropout is proposed as an energy-based framework for pruning neural networks in classification tasks and can prune neural networks without manually modifying the network architecture code.


On the Convergence of Monte Carlo Maximum Likelihood Calculations
SUMMARY Monte Carlo maximum likelihood for normalized families of distributions can be used for an extremely broad class of models. Given any family { he: 0 E 0 } of non-negative integrable
Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
The basic idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise and it is shown that the new method strikes a competitive trade-off in comparison to other estimation methods for unnormalized models.
Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model
The idea is to use an adaptive n-gram model to track the conditional distributions produced by the neural network, and it is shown that a very significant speedup can be obtained on standard problems.
Distributed Representations of Words and Phrases and their Compositionality
This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
A fast and simple algorithm for training neural probabilistic language models
This work proposes a fast and simple algorithm for training NPLMs based on noise-contrastive estimation, a newly introduced procedure for estimating unnormalized continuous distributions and demonstrates the scalability of the proposed approach by training several neural language models on a 47M-word corpus with a 80K-word vocabulary.
Learning word embeddings efficiently with noise-contrastive estimation
This work proposes a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation, and achieves results comparable to the best ones reported, using four times less data and more than an order of magnitude less computing time.
Hierarchical Probabilistic Neural Network Language Model
A hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition, constrained by the prior knowledge extracted from the WordNet semantic hierarchy is introduced.
Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
A new estimation principle is presented to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise, using the model log-density function in the regression nonlinearity, which leads to a consistent (convergent) estimator of the parameters.
Unbiased Monte Carlo Estimation of the Reciprocal of an Integral
Abstract A method to provide an unbiased Monte Carlo estimate of the reciprocal of an integral is described. In Monte Carlo transport calculations, one often uses a single sample as an estimate of an
On Using Very Large Target Vocabulary for Neural Machine Translation
It is shown that decoding can be efficiently done even with the model having a very large target vocabulary by selecting only a small subset of the whole target vocabulary.