David Buttler

Learn More
This paper provides a brief survey of document structural similarity algorithms, including the optimal Tree Edit Distance algorithm and various approximation algorithms. The approximation algorithms include the simple weighted tag similarity algorithm, Fourier transforms of the structure, and a new application of the shingle technique to structural(More)
Despite the existence of several noun phrase coreference resolution data sets as well as several formal evaluations on the task, it remains frustratingly difficult to compare results across different coreference resolution systems. This is due to the high cost of implementing a complete end-to-end coreference resolution system, which often forces(More)
We apply two new automated semantic evaluations to three distinct latent topic models. Both metrics have been shown to align with human evaluations and provide a balance between internal measures of information gain and comparisons to human ratings of coherent topics. We improve upon the measures by introducing new aggregate measures that allows for(More)
We have created a software infrastructure called Reconcile that is a platform for the development of learning-based noun phrase (NP) coreference resolution systems. Reconcile’s architecture was designed to facilitate the rapid creation of coreference resolutions systems, easy implementation of new feature sets and approaches to coreference resolution, and(More)
This paper presents WebCQ, a continual query system for large-scale Web information monitoring. WebCQ is designed to discover and detect changes to Web pages efficiently, and to notify users of interesting changes with personalized messages. Users' Web page monitoring requests are modeled as continual queries on the Web and referred to as Web page(More)
We show that users have different reading behavior when evaluating the interestingness of articles, calling for different parameter configurations for information retrieval algorithms for different users. Better recommendation results can be made if parameters for common information retrieval algorithms, such as the Rocchio algorithm, are learned(More)
We introduce the concept of a QA-Pagelet to refer to the content region in a dynamic page that contains query matches. We present THOR, a scalable and efficient mining system for discovering and extracting QA-Pagelets from the deep Web. A unique feature of THOR is its two-phase extraction framework. In the first phase, pages from a deep Web site are grouped(More)
Conditional random fields (CRFs), which are popular supervised learning models for many natural language processing (NLP) tasks, typically require a large collection of labeled data for training. In practice, however, manual annotation of text documents is quite costly. Furthermore, even large labeled training sets can have arbitrarily limited performance(More)