Auto-Sizing Neural Networks: With Applications to n-gram Language Models

@inproceedings{Murray2015AutoSizingNN,
  title={Auto-Sizing Neural Networks: With Applications to n-gram Language Models},
  author={Kenton Murray and David Chiang},
  booktitle={EMNLP},
  year={2015}
}
Neural networks have been shown to improve performance across a range of natural-language tasks. However, designing and training them can be complicated. Frequently, researchers resort to repeated experimentation to pick optimal settings. In this paper, we address the issue of choosing the correct number of units in hidden layers. We introduce a method for automatically adjusting network size by pruning out hidden units through‘1,1 and ‘2,1 regularization. We apply this method to language… 

Figures and Tables from this paper

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation
TLDR
It is shown that auto-sizing can improve BLEU scores by up to 3.9 points while removing one-third of the parameters from the model while using regularization to delete neurons in a network over the course of training.
Reducing Transformer Depth on Demand with Structured Dropout
TLDR
LayerDrop, a form of structured dropout, is explored, which has a regularization effect during training and allows for efficient pruning at inference time, and shows that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance.
Exploiting the Redundancy in Neural Machine Translation
TLDR
The efficacy of weight pruning as a compression and regularization technique is demonstrated and the distribution of redundancy in the NMT architecture is investigated and the interaction of pruning with other forms of regularization such as dropout is investigated.
Compression of Neural Machine Translation Models via Pruning
TLDR
It is shown that an NMT model with over 200 million parameters can be pruned by 40% with very little performance loss as measured on the WMT'14 English-German translation task.
Compacting Neural Network Classifiers via Dropout Training
TLDR
A systematic comparison of dropout compaction and competing methods on several real-world speech recognition tasks found that drop out compaction achieved comparable accuracy with fewer than 50% of the hidden units, translating to a 2.5x speedup in run-time.
Not All Attention Is All You Need
TLDR
This paper proposes a novel dropout method named AttendOut to let self-attention empowered PrLMs capable of more robust task-specific tuning and demonstrates that state-of-the-art models with elaborate training design may achieve much stronger results.
PARAMETER-FRUGAL NEURAL MACHINE TRANS- LATION
TLDR
In-training matrix factorization is found to be especially powerful on embedding layers, providing a simple and effective method to curtail the number of parameters with minimal impact on model performance, and, at times, an increase in performance.
Refining the Structure of Neural Networks Using Matrix Conditioning
TLDR
This work proposes a practical method that employs matrix conditioning to automatically design the structure of layers of a feed-forward network, by first adjusting the proportion of neurons among theayers of a network and then scaling the size of network up or down.
Sequence-Level Knowledge Distillation
TLDR
It is demonstrated that standard knowledge distillation applied to word-level prediction can be effective for NMT, and two novel sequence-level versions of knowledge distilling are introduced that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search.
Automatic node selection for Deep Neural Networks using Group Lasso regularization
TLDR
The experiment results demonstrate that the DNN training, in which the gLasso regularizer was embedded, successfully selected the hidden layer nodes that are necessary and sufficient for achieving high classification power.
...
...

References

SHOWING 1-10 OF 23 REFERENCES
From Feedforward to Recurrent LSTM Neural Networks for Language Modeling
TLDR
This paper compares count models to feedforward, recurrent, and long short-term memory (LSTM) neural network variants on two large-vocabulary speech recognition tasks, and analyzes the potential improvements that can be obtained when applying advanced algorithms to the rescoring of word lattices on large-scale setups.
OxLM: A Neural Language Modelling Framework for Machine Translation
TLDR
This paper presents an open source implementation of a neural language model for machine translation designed with scalability in mind and provides two optional techniques for reducing the computational cost: the so-called class decomposition trick and a training algorithm based on noise contrastive estimation.
A fast and simple algorithm for training neural probabilistic language models
TLDR
This work proposes a fast and simple algorithm for training NPLMs based on noise-contrastive estimation, a newly introduced procedure for estimating unnormalized continuous distributions and demonstrates the scalability of the proposed approach by training several neural language models on a 47M-word corpus with a 80K-word vocabulary.
Decoding with Large-Scale Neural Language Models Improves Translation
TLDR
This work develops a new model that combines the neural probabilistic language model of Bengio et al., rectified linear units, and noise-contrastive estimation, and incorporates it into a machine translation system both by reranking k-best lists and by direct integration into the decoder.
Dropout: a simple way to prevent neural networks from overfitting
TLDR
It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Fast and Robust Neural Network Joint Models for Statistical Machine Translation
TLDR
A novel formulation for a neural network joint model (NNJM), which augments the NNLM with a source context window, which is purely lexicalized and can be integrated into any MT decoder.
A Neural Probabilistic Language Model
TLDR
This work proposes to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences.
Can artificial neural networks learn language models?
TLDR
This paper investigated an alternative way to build language models, i.e., using artificial neural networks to learn the language model, and shows that the neural network can learn a language model that has performance even better than standard statistical methods.
RNNLM - Recurrent Neural Network Language Modeling Toolkit
We present a freely available open-source toolkit for training recurrent neural network based language models. It can be easily used to improve existing speech recognition and machine translation
Simplifying Neural Networks by Soft Weight-Sharing
TLDR
A more complicated penalty term is proposed in which the distribution of weight values is modeled as a mixture of multiple gaussians, which allows the parameters of the mixture model to adapt at the same time as the network learns.
...
...