Dealing with a large number of classes -- Likelihood, Discrimination or Ranking?
@article{Barber2016DealingWA, title={Dealing with a large number of classes -- Likelihood, Discrimination or Ranking?}, author={David Barber and Aleksandar Botev}, journal={arXiv: Machine Learning}, year={2016} }
We consider training probabilistic classifiers in the case of a large number of classes. The number of classes is assumed too large to perform exact normalisation over all classes. To account for this we consider a simple approach that directly approximates the likelihood. We show that this simple approach works well on toy problems and is competitive with recently introduced alternative non-likelihood based approximations. Furthermore, we relate this approach to a simple ranking objective…
5 Citations
Listwise Learning to Rank by Exploring Unique Ratings
- Computer ScienceWSDM
- 2020
This paper proposes new listwise learning-to-rank models that mitigate the shortcomings of existing ones and proposes a novel and efficient way of refining prediction scores by combining an adapted Vanilla Recurrent Neural Network (RNN) model with pooling given selected documents at previous steps.
Variable length word encodings for neural translation models
- Computer Science
- 2016
BPE is better than character-level decomposition as it is more e cient, especially for the most frequent words, and it gives higher BLEU scores than Hu↵man coding and is able to decompose new words in the language.
S-Rocket: Selective Random Convolution Kernels for Time Series Classification
- Computer ScienceArXiv
- 2022
The kernels selection process is modeled as an optimization problem and a population-based approach is proposed for selecting the most important kernels and shows that on average it can achieve a similar performance to the original models by pruning more than 60% of kernels.
A Framework for Pruning Deep Neural Networks Using Energy-Based Models
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
This paper proposes a framework for pruning DNNs based on a population-based global optimization method that can use any pruning objective function and proposes a simple but efficient objective function based on the concept of energy-based models.
EDropout: Energy-Based Dropout and Pruning of Deep Neural Networks
- Computer ScienceIEEE transactions on neural networks and learning systems
- 2021
Inspired by the dropout concept, EDropout is proposed as an energy-based framework for pruning neural networks in classification tasks and can prune neural networks without manually modifying the network architecture code.
References
SHOWING 1-10 OF 13 REFERENCES
On the Convergence of Monte Carlo Maximum Likelihood Calculations
- Mathematics
- 1994
SUMMARY Monte Carlo maximum likelihood for normalized families of distributions can be used for an extremely broad class of models. Given any family { he: 0 E 0 } of non-negative integrable…
Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
- Computer Science, MathematicsJ. Mach. Learn. Res.
- 2012
The basic idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise and it is shown that the new method strikes a competitive trade-off in comparison to other estimation methods for unnormalized models.
Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model
- Computer ScienceIEEE Transactions on Neural Networks
- 2008
The idea is to use an adaptive n-gram model to track the conditional distributions produced by the neural network, and it is shown that a very significant speedup can be obtained on standard problems.
Distributed Representations of Words and Phrases and their Compositionality
- Computer ScienceNIPS
- 2013
This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
A fast and simple algorithm for training neural probabilistic language models
- Computer ScienceICML
- 2012
This work proposes a fast and simple algorithm for training NPLMs based on noise-contrastive estimation, a newly introduced procedure for estimating unnormalized continuous distributions and demonstrates the scalability of the proposed approach by training several neural language models on a 47M-word corpus with a 80K-word vocabulary.
Learning word embeddings efficiently with noise-contrastive estimation
- Computer ScienceNIPS
- 2013
This work proposes a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation, and achieves results comparable to the best ones reported, using four times less data and more than an order of magnitude less computing time.
Hierarchical Probabilistic Neural Network Language Model
- Computer ScienceAISTATS
- 2005
A hierarchical decomposition of the conditional probabilities that yields a speed-up of about 200 both during training and recognition, constrained by the prior knowledge extracted from the WordNet semantic hierarchy is introduced.
Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
- Computer ScienceAISTATS
- 2010
A new estimation principle is presented to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise, using the model log-density function in the regression nonlinearity, which leads to a consistent (convergent) estimator of the parameters.
Unbiased Monte Carlo Estimation of the Reciprocal of an Integral
- Mathematics
- 2007
Abstract A method to provide an unbiased Monte Carlo estimate of the reciprocal of an integral is described. In Monte Carlo transport calculations, one often uses a single sample as an estimate of an…
On Using Very Large Target Vocabulary for Neural Machine Translation
- Computer ScienceACL
- 2015
It is shown that decoding can be efficiently done even with the model having a very large target vocabulary by selecting only a small subset of the whole target vocabulary.