• Corpus ID: 3996935

Semantic Regularities in Document Representations

  title={Semantic Regularities in Document Representations},
  author={Fei Sun and J. Guo and Yanyan Lan and Jun Xu and Xueqi Cheng},
Recent work exhibited that distributed word representations are good at capturing linguistic regularities in language. This allows vector-oriented reasoning based on simple linear algebra between words. Since many different methods have been proposed for learning document representations, it is natural to ask whether there is also linear structure in these learned representations to allow similar reasoning at document level. To answer this question, we design a new document analogy task for… 

Tables from this paper

Utilizing Embeddings for Ad-hoc Retrieval by Document-to-document Similarity
This paper proposes a novel approach that estimates a semantic relevance score (SEM) based on document-to-document (D2D) similarity of embeddings and shows that the proposed approach outperforms strong baselines on standard TREC test collections.
Neural methods for effective, efficient, and exposure-aware information retrieval
This thesis presents novel neural architectures and methods motivated by the specific needs and challenges of IR tasks, and develops a framework to incorporate query term independence into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval.
Neural Models for Information Retrieval
This tutorial introduces basic concepts and intuitions behind neural IR models, and places them in the context of traditional retrieval models, by introducing fundamental concepts of IR and different neural and non-neural approaches to learning vector representations of text.
Finnish resources for evaluating language model semantics
This research presents three resources for evaluating the semantic quality of Finnish language distributional models: a semantic similarity judgment resource, a word analogy and a word intrusion test set.
Combined WSD algorithms with LSA to identify semantic similarity in unstructured textual data
A method for measuring sentence semantic similarity identification that combines two algorithms from the knowledge-based Word Sense Disambiguation algorithms with Latent Semantic Analysis to identify the semantic similarity of sentences and to compare results with human evaluation is proposed.
An Introduction to Neural Information Retrieval
The monograph provides a complete picture of neural information retrieval techniques that culminate in supervised neural learning to rank models including deep neural network architectures that are trained end-to-end for ranking tasks.
A Feedback-Based Approach to Utilizing Embeddings for Clinical Decision Support
This paper proposes a novel feedback-based approach which considers the semantic association between a retrieved biomedical article and a pseudo feedback set and is able to improve over the best runs in the TREC CDS tasks.


Linguistic Regularities in Sparse and Explicit Word Representations
It is demonstrated that analogy recovery is not restricted to neural word embeddings, and that a similar amount of relational similarities can be recovered from traditional distributional word representations.
Linguistic Regularities in Continuous Space Word Representations
The vector-space word representations that are implicitly learned by the input-layer weights are found to be surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset.
GloVe: Global Vectors for Word Representation
A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Efficient Estimation of Word Representations in Vector Space
Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.
Distributed Representations of Words and Phrases and their Compositionality
This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Improving Word Representations via Global Context and Multiple Word Prototypes
A new neural network architecture is presented which learns word embeddings that better capture the semantics of words by incorporating both local and global document context, and accounts for homonymy and polysemy by learning multiple embedDings per word.
Better Word Representations with Recursive Neural Networks for Morphology
This paper combines recursive neural networks, where each morpheme is a basic unit, with neural language models to consider contextual information in learning morphologicallyaware word representations and proposes a novel model capable of building representations for morphologically complex words from their morphemes.
Distributed Representations of Sentences and Documents
Paragraph Vector is an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents, and its construction gives the algorithm the potential to overcome the weaknesses of bag-of-words models.
Document Embedding with Paragraph Vectors
This work observes that the Paragraph Vector method performs significantly better than other methods, and proposes a simple improvement to enhance embedding quality, and shows that much like word embeddings, vector operations on Paragraph Vectors can perform useful semantic results.
From Word Embeddings To Document Distances
It is demonstrated on eight real world document classification data sets, in comparison with seven state-of-the-art baselines, that the Word Mover's Distance metric leads to unprecedented low k-nearest neighbor document classification error rates.