Corpus ID: 19631717

Learning Features from Co-occurrences: A Theoretical Analysis

@inproceedings{Li2018LearningFF,
  title={Learning Features from Co-occurrences: A Theoretical Analysis},
  author={Yanpeng Li},
  booktitle={COLING},
  year={2018}
}
  • Yanpeng Li
  • Published in COLING 2018
  • Computer Science, Mathematics
Representing a word by its co-occurrences with other words in context is an effective way to capture the meaning of the word. However, the theory behind remains a challenge. In this work, taking the example of a word classification task, we give a theoretical analysis of the approaches that represent a word X by a function f(P(C|X)), where C is a context feature, P(C|X) is the conditional probability estimated from a text corpus, and the function f maps the co-occurrence measure to a prediction… Expand

References

SHOWING 1-10 OF 18 REFERENCES
Class-Based n-gram Models of Natural Language
TLDR
This work addresses the problem of predicting a word from previous words in a sample of text and discusses n-gram models based on classes of words, finding that these models are able to extract classes that have the flavor of either syntactically based groupings or semanticallybased groupings, depending on the nature of the underlying statistics. Expand
Efficient Estimation of Word Representations in Vector Space
TLDR
Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities. Expand
Distributional Structure
TLDR
This discussion will discuss how each language can be described in terms of a distributional structure, i.e. in Terms of the occurrence of parts relative to other parts, and how this description is complete without intrusion of other features such as history or meaning. Expand
WordNet: A Lexical Database for English
TLDR
WordNet1 provides a more effective combination of traditional lexicographic information and modern computing, and is an online lexical database designed for use under program control. Expand
Incorporating rich background knowledge for gene named entity classification and recognition
TLDR
A general framework for gene named entity representation, called feature coupling generalization (FCG), which is to generate higher level features using term frequency and co-occurrence information of highly indicative features in huge amount of unlabeled data, and its performance in a named entity classification task is examined. Expand
Indexing by Latent Semantic Analysis
TLDR
A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. Expand
Indexing by Latent Semantic Analysis
A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”)Expand
Class-based n-gram models of natural language
TLDR
This work addresses the problem of predicting a word from previous words in a sample of text and discusses n-gram models based on classes of words. Expand
Combining labeled and unlabeled data with co-training
TLDR
A PAC-style analysis is provided for a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views, to allow inexpensive unlabeled data to augment, a much smaller set of labeled examples. Expand
A robust data-driven approach for gene ontology annotation
  • Yanpeng Li, Hong Yu
  • Computer Science, Medicine
  • Database J. Biol. Databases Curation
  • 2014
TLDR
A binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, and a filtering method based on high-level GO classes that substantially improved the performance are presented. Expand
...
1
2
...