Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database

  title={Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database},
  author={Edgar Altszyler and Mariano Sigman and Diego Fern{\'a}ndez Slezak},

Figures and Tables from this paper


  • Computer Science
  • 2018
A methodology and techniques to learn semantic representations of low resource languages using word-vector representations for Buryat language are introduced and a simple word embeddings evaluation scheme is proposed that can be easily adapted to any language.

Word Embedding in Small Corpora: A Case Study in Quran

The capability of word2vec for learning semantic representation of words in small corpus is investigated and it is demonstrated that the best performance for skip-gram occurs with 30 numbers of iterations when the dimension is set to 7.

Transferred Embeddings for Igbo Similarity, Analogy, and Diacritic Restoration Tasks

The results indicate that the projected models not only outperform the trained ones on the semantic-based tasks of analogy, word-similarity, and odd-word identifying, but they also achieve enhanced performance on the diacritic restoration with learned diacrite embeddings.

Toward meaningful notions of similarity in NLP embedding models

A method is proposed stating which similarity values and thresholds actually are meaningful for a given embedding model, and how these thresholds, when taken into account during evaluation, change the evaluation scores of the models in similarity test sets.

On the Various Semantics of Similarity in Word Embedding Models

This paper examines when exactly similarity values in word embedding models are meaningful, and proposes a method stating which similarity values actually are meaningful for a given embedding model.

Considerations about learning Word2Vec

It is shown that the learning rate prevents the exact mapping of the co-occurrence matrix, that Word2Vec is unable to learn syntactic relationships, and that it does not suffer from the problem of overfitting.

Word2Vec vs LSA pour la détection des erreurs orthographiques produisant un dérèglement sémantique en arabe (Word2Vec vs LSA for detecting semantic errors in Arabic language)

Two word embedding based methods are described and compared to detect spelling errors more precisely, those generating lexically correct words but causing a semantic disturbance in the sentence.

VEC a semantic search engine for tagged artworks based on word embeddings

An algorithm for generating query vectors from ARTigo’s search queries is presented and an evaluation by users is reported about the recall improvements resulted from each of the applied word embedding approaches.

A Comprehensive Survey on Word Representation Models: From Classical to State-of-the-Art Word Representation Language Models

A variety of text representation methods, and model designs have blossomed in the context of NLP, including SOTA LMs are described, which can transform large volumes of text into effective vector representations capturing the same semantic information.



Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD

This article investigates the use of three further factors—namely, the application of stop-lists, word stemming, and dimensionality reduction using singular value decomposition (SVD)—that have been used to provide improved performance elsewhere and introduces an additional semantic task and explores the advantages of using a much larger corpus.

Improving Distributional Similarity with Lessons Learned from Word Embeddings

It is revealed that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves, and these modifications can be transferred to traditional distributional models, yielding similar gains.

Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors

An extensive evaluation of context-predicting models with classic, count-vector-based distributional semantic approaches, on a wide range of lexical semantics tasks and across many parameter settings shows that the buzz around these models is fully justified.

Efficient Estimation of Word Representations in Vector Space

Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.

Scale-Invariant Transition Probabilities in Free Word Association Trajectories

It is shown that memory loss and cycling probabilities of free word association trajectories can be simultaneously accounted by a model in which transitions are determined by a scale invariant probability distribution.

Neural Word Embedding as Implicit Matrix Factorization

It is shown that using a sparse Shifted Positive PMI word-context matrix to represent words improves results on two word similarity tasks and one of two analogy tasks, and conjecture that this stems from the weighted nature of SGNS's factorization.

Software Framework for Topic Modelling with Large Corpora

This work describes a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion, and implements several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation in a way that makes them completely independent of the training corpus size.

Semantic Compositionality through Recursive Matrix-Vector Spaces

A recursive neural network model that learns compositional vector representations for phrases and sentences of arbitrary syntactic type and length and can learn the meaning of operators in propositional logic and natural language is introduced.

Extracting semantic representations from word co-occurrence statistics: A computational study

This article presents a systematic exploration of the principal computational possibilities for formulating and validating representations of word meanings from word co-occurrence statistics and finds that, once the best procedures are identified, a very simple approach is surprisingly successful and robust over a range of psychologically relevant evaluation measures.