Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database

@article{Altszyler2016ComparativeSO,
  title={Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database},
  author={E. Altszyler and M. Sigman and D. Slezak},
  journal={ArXiv},
  year={2016},
  volume={abs/1610.01520}
}
Word embeddings have been extensively studied in large text datasets. However, only a few studies analyze semantic representations of small corpora, particularly relevant in single-person text production studies. In the present paper, we compare Skip-gram and LSA capabilities in this scenario, and we test both techniques to extract relevant semantic patterns in single-series dreams reports. LSA showed better performance than Skip-gram in small size training corpus in two semantic tests. As a… Expand

Figures, Tables, and Topics from this paper

LEARNING WORD EMBEDDINGS FOR LOW RESOURCE LANGUAGES: THE CASE OF BURYAT
  • 2018
Word-vector representations have been extensively studied for rich resource languages with large text datasets. However, only a few studies analyze semantic representations of low resource languages,Expand
Word Embedding in Small Corpora: A Case Study in Quran
TLDR
The capability of word2vec for learning semantic representation of words in small corpus is investigated and it is demonstrated that the best performance for skip-gram occurs with 30 numbers of iterations when the dimension is set to 7. Expand
Comparative study of word embedding methods in topic segmentation
TLDR
These methods in the field of topic segmentation for both languages Arabic and English are investigated and it is found out that LSA, Word2Vec and GloVe depend on the used language. Expand
Transferred Embeddings for Igbo Similarity, Analogy, and Diacritic Restoration Tasks
TLDR
The results indicate that the projected models not only outperform the trained ones on the semantic-based tasks of analogy, word-similarity, and odd-word identifying, but they also achieve enhanced performance on the diacritic restoration with learned diacrite embeddings. Expand
Toward meaningful notions of similarity in NLP embedding models
TLDR
A method is proposed stating which similarity values and thresholds actually are meaningful for a given embedding model, and how these thresholds, when taken into account during evaluation, change the evaluation scores of the models in similarity test sets. Expand
A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art Word Representation Language Models
TLDR
A variety of text representation methods, and model designs have blossomed in the context of NLP, including SOTA LMs are described, which can transform large volumes of text into effective vector representations capturing the same semantic information. Expand
On the Various Semantics of Similarity in Word Embedding Models
TLDR
This paper examines when exactly similarity values in word embedding models are meaningful, and proposes a method stating which similarity values actually are meaningful for a given embedding model. Expand
Considerations about learning Word2Vec
Despite the large diffusion and use of embedding generated through Word2Vec, there are still many open questions about the reasons for its results and about its real capabilities. In particular, toExpand
Word2Vec vs LSA pour la détection des erreurs orthographiques produisant un dérèglement sémantique en arabe (Word2Vec vs LSA for detecting semantic errors in Arabic language)
Word2Vec vs LSA for detecting semantic errors in Arabic language. Arabic words are lexically close to each other. The probability of having a correct word by making a typographical error is greaterExpand
VEC a semantic search engine for tagged artworks based on word embeddings
This bachelor thesis reports on the implementation of a semantic search engine based on word embeddings. The dataset comes from the ARTigo ecosystems, which comprises at the same time a taggingExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 48 REFERENCES
Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD
TLDR
This article investigates the use of three further factors—namely, the application of stop-lists, word stemming, and dimensionality reduction using singular value decomposition (SVD)—that have been used to provide improved performance elsewhere and introduces an additional semantic task and explores the advantages of using a much larger corpus. Expand
Improving Distributional Similarity with Lessons Learned from Word Embeddings
TLDR
It is revealed that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves, and these modifications can be transferred to traditional distributional models, yielding similar gains. Expand
Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors
TLDR
An extensive evaluation of context-predicting models with classic, count-vector-based distributional semantic approaches, on a wide range of lexical semantics tasks and across many parameter settings shows that the buzz around these models is fully justified. Expand
Efficient Estimation of Word Representations in Vector Space
TLDR
Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities. Expand
Scale-Invariant Transition Probabilities in Free Word Association Trajectories
TLDR
It is shown that memory loss and cycling probabilities of free word association trajectories can be simultaneously accounted by a model in which transitions are determined by a scale invariant probability distribution. Expand
Neural Word Embedding as Implicit Matrix Factorization
TLDR
It is shown that using a sparse Shifted Positive PMI word-context matrix to represent words improves results on two word similarity tasks and one of two analogy tasks, and conjecture that this stems from the weighted nature of SGNS's factorization. Expand
Studying dream content using the archive and search engine on DreamBank.net
TLDR
This paper shows how the dream archive and search engine on DreamBank.net can be used to generate new findings on dream content, some of which raise interesting questions about the relationship between dreaming and various forms of waking thought. Expand
Software Framework for Topic Modelling with Large Corpora
TLDR
This work describes a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion, and implements several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation in a way that makes them completely independent of the training corpus size. Expand
Semantic Compositionality through Recursive Matrix-Vector Spaces
TLDR
A recursive neural network model that learns compositional vector representations for phrases and sentences of arbitrary syntactic type and length and can learn the meaning of operators in propositional logic and natural language is introduced. Expand
Extracting semantic representations from word co-occurrence statistics: A computational study
TLDR
This article presents a systematic exploration of the principal computational possibilities for formulating and validating representations of word meanings from word co-occurrence statistics and finds that, once the best procedures are identified, a very simple approach is surprisingly successful and robust over a range of psychologically relevant evaluation measures. Expand
...
1
2
3
4
5
...