Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents

  title={Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents},
  author={Marina Danilevsky and Chi Wang and Nihit Desai and Xiang Ren and Jingyi Guo and Jiawei Han},
We introduce a framework for topical keyphrase generation and ranking, based on the output of a topic model run on a collection of short documents. By shifting from the unigramcentric traditional methods of keyphrase extraction and ranking to a phrase-centric approach, we are able to directly compare and rank phrases of different lengths. Our method defines a function to rank topical keyphrases so that more highly ranked keyphrases are considered to be more representative phrases for that topic… 

Figures and Tables from this paper

Most Important First - Keyphrase Scoring for Improved Ranking in Settings With Limited Keyphrases

This paper proposes a scoring of the extracted keyphrases as post-processing to rerank the list of extracted phrases in order to improve precision and recall particularly for the top phrases.

TOP-Rank: A Novel Unsupervised Approach for Topic Prediction Using Keyphrase Extraction for Urdu Documents

A novel unsupervised approach for topic prediction for Urdu language has been introduced which is able to extract more significant information from the documents and outperforms existing techniques and holds the ability to produce more meaningful topics.

Scalable Topical Phrase Mining from Text Corpora

This work proposes a novel phrase mining framework to segment a document into single and multi-word phrases, and a new topic model that operates on the induced document partition that discovers high quality topical phrases with negligible extra cost to the bag-of-words topic model in a variety of datasets.

Scalable and Robust Construction of Topical Hierarchies

A scalable and robust algorithm is proposed for constructing a hierarchy of topics from a text collection based on a tensor orthogonal decomposition technique, which reduces the time of construction by several orders of magnitude and renders it possible for users to interactively revise the hierarchy.

Selecting Article Segment Titles Based on Keyphrase Features and Semantic Relatedness

  • Yuming GuoM. Iwaihara
  • Computer Science
    2018 7th International Congress on Advanced Applied Informatics (IIAI-AAI)
  • 2018
A method for selecting titles for segments in long documents, using Wikipedia articles for experimental evaluations, and combines the features SimPF and Embedding-vector to enhance the efficiency and rationality.

CITPM: A Cluster-Based Iterative Topical Phrase Mining Framework

A novel framework CITPM for topical phrase mining is presented, which views a corpus as a mixture of clusters (domains), and each cluster is characterized by documents sharing similar topical distributions.

Exploiting extensible background knowledge for clustering-based automatic keyphrase extraction

SemCluster is introduced, a clustering-based unsupervised keyphrase extraction method that addresses the coverage limitation problem by using an extensible approach that integrates an internal ontology (i.e., WordNet) with other knowledge sources to gain a wider background knowledge.

Extracting representative phrases from Wikipedia article sections

  • Shan LiuM. Iwaihara
  • Computer Science
    2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS)
  • 2016
This work aims at extracting informative phrases that readers can refer to within the same Wikipedia article, and combines Normalized Google Distance and nDCG to measure semantic relatedness between generated phrases and hidden original section titles.



Automatic Keyphrase Extraction via Topic Decomposition

A Topical PageRank (TPR) is built on word graph to measure word importance with respect to different topics and shows that TPR outperforms state-of-the-art keyphrase extraction methods on two datasets under various evaluation metrics.

Clustering to Find Exemplar Terms for Keyphrase Extraction

This work proposes an unsupervised method for keyphrase extraction that outperforms sate-of-the-art graph-based ranking methods (TextRank) by 9.5% in F1-measure and guarantees the document to be semantically covered by these exemplar terms.

Learning Algorithms for Keyphrase Extraction

The experimental results support the claim that a custom-designed algorithm (GenEx), incorporating specialized procedural domain knowledge, can generate better keyphrases than a general-purpose algorithm (C4.5).

Using Noun Phrase Heads to Extract Document Keyphrases

The simple noun phrase-based system performs roughly as well as a state-of-the-art, corpus-trained keyphrase extractor; ratings for individual keyphrases do not necessarily correlate with ratings for sets of keyphRases for a document.

Extracting key terms from noisy and multitheme documents

Evaluations of the method show that it outperforms existing methods producing key terms with higher precision and recall, and appears to be substantially more effective on noisy and multi-theme documents than existing methods.

Topical Keyphrase Extraction from Twitter

A context-sensitive topical PageRank method for keyword ranking and a probabilistic scoring function that considers both relevance and interestingness of keyphrases for keyphrase ranking are proposed.

Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval

Most topic models, such as latent Dirichlet allocation, rely on the bag-of-words assumption. However, word order and phrases are often critical to capturing the meaning of text in many text mining

A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

This article presents a hierarchical generative probabilistic model of topical phrases that simultaneously infers the location, length, and topic of phrases within a corpus and relaxes the bag-of-words assumption within phrases by using a hierarchy of Pitman-Yor processes.

Cumulated gain-based evaluation of IR techniques

This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position, and test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences.

A Language Model Approach to Keyphrase Extraction

A new approach is to use pointwise KL-divergence between multiple language models for scoring both phraseness and informativeness, which can be unified into a single score to rank extracted phrases.