# Efficient Training on Very Large Corpora via Gramian Estimation

@article{Krichene2019EfficientTO, title={Efficient Training on Very Large Corpora via Gramian Estimation}, author={Walid Krichene and Nicolas Mayoraz and Steffen Rendle and Li Zhang and Xinyang Yi and Lichan Hong and Ed H. Chi and John R. Anderson}, journal={ArXiv}, year={2019}, volume={abs/1807.07187} }

We study the problem of learning similarity functions over very large corpora using neural network embedding models. These models are typically trained using SGD with sampling of random observed and unobserved pairs, with a number of samples that grows quadratically with the corpus size, making it expensive to scale to very large corpora. We propose new efficient methods to train these models without having to sample unobserved pairs. Inspired by matrix factorization, our approach relies on…

## Figures, Tables, and Topics from this paper

## 34 Citations

Sampling-bias-corrected neural modeling for large corpus item recommendations

- Computer ScienceRecSys
- 2019

A novel algorithm for estimating item frequency from streaming data that can work without requiring fixed item vocabulary, and is capable of producing unbiased estimation and being adaptive to item distribution change.

An Efficient Newton Method for Extreme Similarity Learning with Nonlinear Embeddings

- Computer ScienceArXiv
- 2020

This work novelly applies the Newton method to the problem of learning similarity by using nonlinear embedding models (e.g., neural networks) from all possible pairs to propose an efficient algorithm which successfully eliminates the cost.

Pre-training Tasks for Embedding-based Large-scale Retrieval

- Computer Science, MathematicsICLR
- 2020

It is shown that the key ingredient of learning a strong embedding-based Transformer model is the set of pre- training tasks, and with adequately designed paragraph-level pre-training tasks, the Transformer models can remarkably improve over the widely-used BM-25 as well as embedding models without Transformers.

Self-supervised Learning for Deep Models in Recommendations

- Computer Science, MathematicsArXiv
- 2020

The results demonstrate that the proposed framework outperforms learning with the supervision task only and other state-of-the-art regularization techniques in the context of retrieval.

Breaking the Glass Ceiling for Embedding-Based Classifiers for Large Output Spaces

- Computer ScienceNeurIPS
- 2019

GLaS, a new regularizer for embedding-based neural network approaches, is proposed, a natural generalization from the graph Laplacian and spread-out regularizers, and empirically it addresses the drawback of each regularizer alone when applied to the extreme classification setup.

ALX: Large Scale Matrix Factorization on TPUs

- Computer ScienceArXiv
- 2021

We present ALX, an open-source library for distributed matrix factorization using Alternating Least Squares, written in JAX. Our design allows for efficient use of the TPU architecture and scales…

On Sampling Top-K Recommendation Evaluation

- Computer Science, MathematicsKDD
- 2020

This work thoroughly investigates the relationship between the sampling and global top-K Hit-Ratio (HR), originally proposed by Koren[2] and extensively used by others, and demonstrates both theoretically and experimentally that the sampling top-k Hit- Ratio provides an accurate approximation of its global counterpart.

Improving Relevance Prediction with Transfer Learning in Large-scale Retrieval Systems

- 2019

Machine learned large-scale retrieval systems require a large amount of training data representing query-item relevance. However, collecting users’ explicit feedback is costly. In this paper, we…

Scalable representation learning and retrieval for display advertising

- Computer ScienceArXiv
- 2021

This work shows that combining large-scale matrix factorization with lightweight embedding fine-tuning unlocks state-of-the-art performance at scale, and proposes an efficient model (LED, for Lightweight EncoderDecoder) reaching a new trade-off between complexity, scale and performance.

Personalized Ranking with Importance Sampling

- Computer ScienceWWW
- 2020

A new ranking loss based on importance sampling is proposed so that more informative negative samples can be better used and the loss function is verified to make better use of negative samples and to require fewer negative samples when they are more informative.

## References

SHOWING 1-10 OF 28 REFERENCES

Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets

- Computer ScienceNIPS
- 2015

This work develops an original algorithmic approach which, for a family of loss functions that includes squared error and spherical softmax, can compute the exact loss, gradient update for the output weights, and gradient for backpropagation, all in O(d^2) per example instead of O(Dd), remarkably without ever computing the D-dimensional output.

Swivel: Improving Embeddings by Noticing What's Missing

- Computer ScienceArXiv
- 2016

We present Submatrix-wise Vector Embedding Learner (Swivel), a method for generating low-dimensional feature embeddings from a feature co-occurrence matrix. Swivel performs approximate factorization…

Strategies for Training Large Vocabulary Neural Language Models

- Computer ScienceACL
- 2016

A systematic comparison of strategies to represent and train large vocabularies, includingsoftmax, hierarchical softmax, target sampling, noise contrastive estimation and self normalization, and extends selfnormalization to be a proper estimator of likelihood and introduce an efficient variant of softmax.

Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model

- Computer Science, MedicineIEEE Transactions on Neural Networks
- 2008

The idea is to use an adaptive n-gram model to track the conditional distributions produced by the neural network, and it is shown that a very significant speedup can be obtained on standard problems.

GloVe: Global Vectors for Word Representation

- Computer ScienceEMNLP
- 2014

A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

Efficient Estimation of Word Representations in Vector Space

- Computer ScienceICLR
- 2013

Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.

Neural Word Embedding as Implicit Matrix Factorization

- Computer Science, MathematicsNIPS
- 2014

It is shown that using a sparse Shifted Positive PMI word-context matrix to represent words improves results on two word similarity tasks and one of two analogy tasks, and conjecture that this stems from the weighted nature of SGNS's factorization.

Large Scale Online Learning of Image Similarity through Ranking

- Computer ScienceIbPRIA
- 2009

OASIS is an online dual approach using the passive-aggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost, which suggests that query-independent similarity could be accurately learned even for large-scale datasets that could not be handled before.

node2vec: Scalable Feature Learning for Networks

- Computer Science, MathematicsKDD
- 2016

In node2vec, an algorithmic framework for learning continuous feature representations for nodes in networks, a flexible notion of a node's network neighborhood is defined and a biased random walk procedure is designed, which efficiently explores diverse neighborhoods.

A Generic Coordinate Descent Framework for Learning from Implicit Feedback

- Computer ScienceWWW
- 2017

It is shown that k-separability is a sufficient property to allow efficient optimization of implicit recommender problems with CD, and a new framework for deriving efficient CD algorithms for complex recommender models is provided.