Untangling Text Data Mining

@inproceedings{Hearst1999UntanglingTD,
  title={Untangling Text Data Mining},
  author={Marti A. Hearst},
  booktitle={ACL},
  year={1999}
}
The possibilities for data mining from large text collections are virtually untapped. Text expresses a vast, rich range of information, but encodes this information in a form that is difficult to decipher automatically. Perhaps for this reason, there has been little work in text data mining to date, and most people who have talked about it have either conflated it with information access or have not made use of text directly to discover heretofore unknown information. In this paper I will… 

Figures and Tables from this paper

Data Mining of Text Files
A Brief Survey of Text Mining
TLDR
The main analysis tasks preprocessing, classification, clustering, information extraction and visualization are described and a number of successful applications of text mining are discussed.
Text analysis and knowledge mining system
TLDR
By applying the prototype system named TAKMI (Text Analysis and Knowledge Mining) to textual databases in PC help centers, the system can automatically detect product failures; determine issues that have led to rapid increases in the number of calls and their underlying reasons; and analyze help center productivity and changes in customers' behavior involving a particular product, without reading any of the text.
Framework for Knowledge Discovery from Journal Articles Using Text Mining Techniques
TLDR
This study discusses text mining as a young interdisciplinary field in the intersection of the related areas such as information access - otherwise known as information retrieval, computational linguistics, data mining, statistics and natural language processing.
Information Extraction -a text mining approach
TLDR
This paper presents a framework for text mining, called DISCOTEX (discovery from text extraction), using a learned information extraction system to transform text into more structured data which is then mined for interesting relationships.
MINING : PRINCIPLES AND APPLICATIONS
TLDR
In this paper, some of the many possibilities of Data Mining on text collection are explored and the main ideas about how to pursue Text Mining are briefly outlined.
Pipelines for Ad-hoc Large-scale Text Mining
TLDR
This thesis contributes to the question of how to address information needs from text mining ad-hoc in an efficient and domain-robust manner and shows that text analysis pipelines can be designed automatically, which process only portions of text that are relevant for the information need at hand.
Mining with Information Extraction
TLDR
A framework for text mining is presented, called DISCOTEX (Discovery from Text EXtraction), using a learned information extraction system to transform text into more structured data which is then mined for interesting relationships.
Using Information Extraction to Aid the Discovery of Prediction Rules from Text
TLDR
A system called DiscoTEX is described, that combines IE and KDD methods to perform a text mining task, discovering prediction rules from natural-language corpora by integrating an IE module based on Rapier and a rule-learning module, Ripper.
TEXT MINING ALGORITHM DISCOTEX (DIS-COVERY FROM TEXT EXTRACTION) WITH INFORMATION EXTRACTION
TLDR
A framework for text mining is presented, called DISCOTEX (Discovery from Text EXtraction), using a learned information extraction system to transform text into more structured data which is then mined for interesting relationships.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 60 REFERENCES
Knowledge Discovery in Textual Databases (KDT)
TLDR
This research combines the KDD and text categorization paradigms and suggests advances to the state of the art in both areas.
Visualization Techniques to Explore Data Mining Results for Document Collections
TLDR
Document Explorer is a system that offers various preprocessing tools to prepare collections of text or multimedia documents which are available in distributed environments and includes data mining methods based on searching for patterns like frequent sets or association rules, which are used in this system as an highly interactive technique to present the mining results.
Learning to Extract Symbolic Knowledge from the World Wide Web
TLDR
The goal of the research described here is to automatically create a computer understandable world wide knowledge base whose content mirrors that of the World Wide Web, and several machine learning algorithms for this task are described.
Integrated Support for Data Archeology
TLDR
This work describes a system that supports the data archaeologist with a natural, object-oriented representation of an application domain; a powerful query language and database translation routines; and an easy-to-use and flexible user interface that supports interactive exploration.
Keyword-Based Browsing and Analysis of Large Document Sets
TLDR
The KDT system for KnowledgeDiscovery in Texts is described, built on top of atext-categorization paradigm where textarticles are annotated with keywordsorganized in a hierarchical structure, providing a generalframework for knowledge discovery and exploration in collections of unstructuredtext.
Query expansion using lexical-semantic relations
TLDR
Examination of the utility of lexical query expansion in the large, diverse TREC collection shows this query expansion technique makes little difference in retrieval effectiveness if the original queries are relatively complete descriptions of the information being sought even when the concepts to be expanded are selected by hand.
Query expansion using local and global document analysis
TLDR
It is shown that using global analysis techniques, such as word contezt and phrase structure, on the local aet of documents produces results that are both more effective and more predictable than simple local feedback.
Galaxy of news: an approach to visualizing and understanding expansive news landscapes
TLDR
This research has been generalized into a model for news access and visualization to provide automatic construction of news information spaces and derivation of an interactive news experience.
Lexical Discovery with an Enriched Semantic Network
TLDR
This paper introduces a database system called FreeNet that facilitates the description and exploration of finite binary relations and describes the design and implementation of Lexical FreeNet, a semantic network that mixes WordNet-derived semantic relations with data-derived and phonetically-derived relations.
Using large corpora
Introduction to the special issue on computational linguistics using large corpora, Kenneth W. Church and Robert L. Mercer generalized probabilistic LR parsing of natural language (corpora) with
...
1
2
3
4
5
...