Keyword extraction from a single document using word co-occurrence statistical information

  title={Keyword extraction from a single document using word co-occurrence statistical information},
  author={Yutaka Matsuo and Mitsuru Ishizuka},
  journal={Int. J. Artif. Intell. Tools},
We present a new keyword extraction algorithm that applies to a single document without using a corpus. Frequent terms are extracted first, then a set of cooccurrence between each term and the frequent terms, i.e., occurrences in the same sentences, is generated. Co-occurrence distribution shows importance of a term in the documentas follows. If probability distribution of co-occurrence between term a and the frequent terms is biased to a particular subset of frequent terms, then term a is… 

Figures and Tables from this paper

Keyword Extraction Using Word Co-occurrence
The results show that using word co-occurrence information can improve precision and recall over tf.idf, and some alternative relevance measures that do use relations between words are studied.
Improved algorithm for keywords extraction from documents without corpus
  • J. ChenJ. Wu
  • Economics, Education
    2009 IEEE 10th International Conference on Computer-Aided Industrial Design & Conceptual Design
  • 2009
In this paper, an algorithm for extracting keywords without corpus is described. We use the co-occurrence information of the words and the biases of distribution to extract the more important words
Applying frequency and location information to keyword extraction in single document
  • Ying Qin
  • Economics, Computer Science
    2012 IEEE 2nd International Conference on Cloud Computing and Intelligence Systems
  • 2012
Experimental results of the extraction approach based on the keyword extraction based on statistical information of words outperform TFIDF, TextRank and other unsupervised methods by comparing with them on the same corpus.
A modified approach to keyword extraction based on word-similarity
This paper proposes a new method to build a word similarity thesaurus using the semantic information from the theSaurus, together with TF.IDF and word's first occurrence, and a keyword extraction algorithm is demonstrated, the results and analysis are given.
A Study on Keyword Extraction From a Single Document Using Term Clustering
  • S. Han
  • Economics, Education
  • 2010
It showed that a new keyword extraction algorithm applied to a single document with term clustering fulfills the necessary conditions which good keywords should have.
Keyword Extraction Based on Lexical Chains and Word Co-occurrence for Chinese News Web Pages
Lexical chains and word co-occurrence are combined in this paper to extract keywords for Chinese news Web pages in the proposed algorithm KELCC, which is not domain-specific and can be applied to a single Web page without corpus.
Keyword Extraction Based on Word Co-Occurrence Statistical Information for Arabic Text
The results of these experiments showed the ability of the 2 method to be applied on the Arabic documents and it has an acceptable performance among other techniques.
Automatic Keyword Extraction From Any Text Document Using N-gram Rigid Collocation
A fuzzy set theoretic approach, fuzzy n-gram indexing, is used to extract n- Gram keywords, which neither requires a dictionary or thesaurus nor does it depend on the size of text document.
Keyword Extraction from Short Documents Using Three Levels of Word Evaluation
A novel unsupervised keyword extraction approach called Informativeness-based Keyword Extraction (IKE) is proposed where each document is assessed on three levels: corpus level, cluster level and document level.
A Statistical Approach of Keyword Extraction for Efficient Retrieval
The keyword extraction is improved using a hybrid technique in which the entire document is split into multiple domains using a master keyword and the frequency of all unique words is found in every domain.


Similarity-Based Models of Word Cooccurrence Probabilities
This work proposes a method for estimating the probability of such previously unseen word combinations using available information on “most similar” words, and describes probabilistic word association models based on distributional word similarity, and applies them to two tasks, language modeling and pseudo-word disambiguation.
KeyGraph: automatic indexing by co-occurrence graph based on building construction metaphor
KeyGraph presents an algorithm for extracting keywords representing the asserted main point in a document, without relying on external devices such as natural-language processing tools or a document corpus, based on the segmentation of a graph.
Text Mining at the Term Level
This paper describes the Term Extraction module of the Document Explorer system, and provides experimental evaluation performed on a set of 52,000 documents published by Reuters in the years 1995–1996.
This paper tries to give an overview of the principles and methods of automatic term recognition and two major trends are examined, i.e., studies in automatic recognition of significant elements for indexing mainly carried out in information-retrieval circles and current research in automaticterm recognition in the field of computational linguistics.
Distributional Clustering of English Words
Deterministic annealing is used to find lowest distortion sets of clusters: as the annealed parameter increases, existing clusters become unstable and subdivide, yielding a hierarchical "soft" clustering of the data.
Text Mining, knowledge extraction from unstructured textual data
This paper presents two examples of information that can be automatically extracted from text collections: probabilistic associations of key-words and prototypical document instances and the Natural Language Processing tools necessary for such extractions.
A statistical interpretation of term specificity and its application in retrieval
It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms.
An algorithm for suffix stripping
An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL and performs slightly better than a much more elaborate system with which it has been compared.
Statistical Models for Co-occurrence Data
A statistical framework for analyzing co-occurrence data in a general setting where elementary observations are joint occurrences of pairs of abstract objects from two finite sets is developed and a novel family of mixture models is proposed.