Learn More
The paper investigates techniques for extracting data from HTML sites through the use of automatically generated wrappers. To automate the wrapper generation and the data extraction process , the paper develops a novel technique to compare HTML pages and generate a wrapper based on their similarities and differences. Experimental results on real-life(More)
Many Web sites include signiicant and substantial pieces of information, in a way that is often diicult to share, correlate and maintain. In many cases the management of a Web site can greatly beneet from the adoption of methods and techniques borrowed from the database eld. This paper introduces a methodology for designing and maintaining large Web sites(More)
Data extraction from web pages is performed by software modules called wrappers. Recently, some systems for the automatic generation of wrappers have been proposed in the literature. These systems are based on unsupervised inference techniques: taking as input a small set of sample pages, they can produce a common wrapper to extract relevant data. However,(More)
Data-intensive Web sites are large sites based on a back-end database, with a fairly complex hypertext structure. The paper develops two main contributions: (a) a specific design methodology for data-intensive Web sites, composed of a set of steps and design transformations that lead from a conceptual specification of the domain of interest to the actual(More)
We develop a new algorithm for clustering search results. Differently from many other clustering systems that have been recently proposed as a post-processing step for Web search engines, our system is not based on phrase analysis inside snippets, but instead uses Latent Semantic Indexing on the whole document content. A main contribution of the paper is a(More)
The paper discusses the issue of views in the Web context. We introduce a set of languages for managing and restructuring data coming from the World Wide Web. We present a specific data model, called the ARANEUS Data Model, inspired to the structures typically present in Web sites. The model allows us to describe the scheme of a Web hypertext, in the spirit(More)