Learn More
This paper presents a fully automated object extraction system ? Omini. A distinct feature of Omini is the suite of algorithms and the automatically learned information extraction rules for discovering and extracting objects from dynamic Web pages or static Web pages that contain multiple object instances. We evaluated the system using more than 2,000 Web(More)
— This paper provides a brief survey of document structural similarity algorithms, including the optimal Tree Edit Distance algorithm and various approximation algorithms. The approximation algorithms include the simple weighted tag similarity algorithm, Fourier transforms of the structure, and a new application of the shingle technique to structural(More)
Despite the existence of several noun phrase coref-erence resolution data sets as well as several formal evaluations on the task, it remains frustratingly difficult to compare results across different corefer-ence resolution systems. This is due to the high cost of implementing a complete end-to-end coreference resolution system, which often forces(More)
We apply two new automated semantic evaluations to three distinct latent topic models. Both metrics have been shown to align with human evaluations and provide a balance between internal measures of information gain and comparisons to human ratings of coherent topics. We improve upon the measures by introducing new aggregate measures that allows for(More)
Search engines, such as Google, assign scores to news articles based on their relevancy to a query. However, not all relevant articles for the query may be interesting to a user. For example, if the article is old or yields little new information, the article would be uninteresting. Relevancy scores do not take into account what makes an article(More)
In this paper, we introduce the concept of a QA-Pagelet to refer to the content region in a dynamic page that contains query matches. We present THOR, a scalable and efficient mining system for discovering and extracting QA-Pagelets from the Deep Web. A unique feature of THOR is its two-phase extraction framework. In the first phase, pages from a deep web(More)
We introduce multiple topic tracking (MTT) for iScore to better recommend news articles for users with multiple interests and to address changes in user interests over time. As an extension of the basic Rocchio algorithm, traditional topic detection and tracking, and single-pass clustering, MTT maintains multiple interest profiles to identify interesting(More)
We show that users have different reading behavior when evaluating the interestingness of articles, calling for different parameter configurations for information retrieval algorithms for different users. Better recommendation results can be made if parameters for common information retrieval algorithms, such as the Rocchio algorithm, are learned(More)