Learn More
More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4(More)
Witten and Frank's textbook was one of two books that I used for a data mining class in the Fall of 2001. The book covers all major methods of data mining that produce a knowledge representation as output. Knowledge representation is hereby understood as a representation that can be studied, understood, and interpreted by human beings, at least in(More)
This book published in 1999 is a revised second edition of the version that first appeared in 1994. The authors have incorporated many changes in the second edition by updating various chapters to include the latest developments. The book is concerned with the task of managing large volumes of text and image data amounting to several gigabytes. The(More)
This paper describes how to automatically cross-reference documents with Wikipedia: the largest knowledge base ever known. It explains how machine learning can be used to identify significant terms within unstructured text, and enrich it with links to the appropriate Wikipedia articles. The resulting link detector and disambiguator performs very well, with(More)
The recently developed technique of arithmetic coding, in conjunction with a Markov model of the source, is a powerful method of data compression in situations where a linear treatment is inappropriate. Adaptive coding allows the model to be constructed dynamically by both encoder and decoder during the course of the transmission, and has been shown to(More)
Keyphrases provide semantic metadata that summarize and characterize documents. Kea is an algorithm for automatically extracting keyphrases from text. We use a large test corpus to evaluate its effectiveness in terms of how many author-assigned keyphrases are correctly identified. The system is simple, robust, and publicly available. Kea identifies(More)
Keyphrases are an important means of document summarization, clustering, and topic search. Only a small minority of documents have author-assigned keyphrases, and manually assigning keyphrases to existing documents is very laborious. Therefore it is highly desirable to automate the keyphrase extraction process. This paper shows that a simple procedure for(More)
relatedness between terms using the links found within their corresponding Wikipedia articles. Unlike other techniques based on Wikipedia, WLM is able to provide accurate measures efficiently, using only the links between articles rather than their textual content. Before describing the details, we first outline the other systems to which it can be(More)