• Corpus ID: 3897289

Word2Bits - Quantized Word Vectors

  title={Word2Bits - Quantized Word Vectors},
  author={Maximilian Lam},
Word vectors require significant amounts of memory and storage, posing issues to resource limited devices like mobile phones and GPUs. [] Key Method We furthermore show that training with the quantization function acts as a regularizer. We train word vectors on English Wikipedia (2017) and evaluate them on standard word similarity and analogy tasks and on question answering (SQuAD). Our quantized word vectors not only take 8-16x less space than full precision (32 bit) word vectors but also outperform them on…

Figures and Tables from this paper

Quantized Transformer

It is shown that transformer can be quantized to 8 bits with little loss of performance, which will enable deployment of large transformer model on server and run efficiently on mobile devices, making it possible to translate offline.

Effective Dimensionality Reduction for Word Embeddings

This work presents a novel technique that efficiently combines PCA based dimensionality reduction with a recently proposed post-processing algorithm, to construct effective word embeddings of lower dimensions.

Hamming Sentence Embeddings for Information Retrieval

This work investigates the compression of sentence embeddings using a neural encoder-decoder architecture, which is trained by minimizing reconstruction error, and uses latent representations in Hamming space produced by the encoder for similarity calculations.

Tensorized Embedding Layers for Efficient Model Compression

This work introduces a novel way of parametrizing embedding layers based on the Tensor Train (TT) decomposition, which allows compressing the model significantly at the cost of a negligible drop or even a slight gain in performance.

Quantized Transformer

It is found that the Transformer has struggled to adapt to the quantized scheme, and has so far failed to meaningfully learn on the IWSLT 2014 Vietnamese-English translation task.

Text classification with word embedding regularization and soft similarity measure

This work investigates the individual and joint effect of the two word embedding regularization techniques on the document processing speed and the task performance of the SCM and the Word Mover's Distance on text classification.

Tensorized Embedding Layers

A novel way of parameterizing embedding layers based on the Tensor Train decomposition is introduced, which allows compressing the model significantly at the cost of a negligible drop or even a slight gain in performance.

A Framework for Enhancing Word Embeddings with Task-Specific Information

A pipeline framework containing two main modules trying to solve the issue of word embeddings generated by Word2Vec, showing that the involvement of task-specific semantic information can actually benefit the embedding performance on the example task of entity categorization.

A calculation cost reduction method for a log-likelihood maximization in word2vec

  • S. NakamuraM. Kimura
  • Computer Science
    2019 25th International Conference on Automation and Computing (ICAC)
  • 2019
The purpose of this study was to speed up the training of Continuous Bag-of-Word Model (CBOW), which is one of the word2vec models, by reducing the calculation cost of the likelihood function.

transformers . zip : Compressing Transformers with Pruning and Quantization

This work is the first to apply quantization methods to the Transformer architecture and thefirst to compare quantization and pruning on the Trans transformer architecture, and finds the proposed quantization method is both significantly faster and gives equal or better performance at the same compression level.



Compressing Word Embeddings via Deep Compositional Code Learning

This work proposes to directly learn the discrete codes in an end-to-end neural network by applying the Gumbel-softmax trick, and achieves 98% in a sentiment analysis task and 94% ~ 99% in machine translation tasks without performance loss.

Efficient Estimation of Word Representations in Vector Space

Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.

GloVe: Global Vectors for Word Representation

A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

Learned in Translation: Contextualized Word Vectors

Adding context vectors to a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors improves performance over using only unsupervised word and character vectors on a wide variety of common NLP tasks.

Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance

It is demonstrated that most intrinsic evaluations are poor predictors of downstream performance, and this issue can be traced in part to a failure to distinguish specific similarity from relatedness in intrinsic evaluation datasets.

Better Word Representations with Recursive Neural Networks for Morphology

This paper combines recursive neural networks, where each morpheme is a basic unit, with neural language models to consider contextual information in learning morphologicallyaware word representations and proposes a novel model capable of building representations for morphologically complex words from their morphemes.

FastText.zip: Compressing text classification models

This work proposes a method built upon product quantization to store the word embeddings, which produces a text classifier, derived from the fastText approach, which at test time requires only a fraction of the memory compared to the original one, without noticeably sacrificing the quality in terms of classification accuracy.

Improving Distributional Similarity with Lessons Learned from Word Embeddings

It is revealed that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves, and these modifications can be transferred to traditional distributional models, yielding similar gains.

Distributed Representations of Words and Phrases and their Compositionality

This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

Training deep neural networks with low precision multiplications

It is found that very low precision is sufficient not just for running trained networks but also for training them, and it is possible to train Maxout networks with 10 bits multiplications.