Learn More
The paper investigates techniques for extracting data from HTML sites through the use of automatically generated wrappers. To automate the wrapper generation and the data extraction process , the paper develops a novel technique to compare HTML pages and generate a wrapper based on their similarities and differences. Experimental results on real-life(More)
Many Web sites include signiicant and substantial pieces of information, in a way that is often diicult to share, correlate and maintain. In many cases the management of a Web site can greatly beneet from the adoption of methods and techniques borrowed from the database eld. This paper introduces a methodology for designing and maintaining large Web sites(More)
| Extending database-like techniques to semi-structured and Web data sources is becoming a prominent research eld. These data sources are essentially collections of textual documents. Hence, in this context, one of the key tasks consists in wrapping documents to build database abstractions of their content that can be manipulated using high-level tools.(More)
The paper develops Editor, a language for manipulating semi-structured documents, such as the ones typically available on the Web. Editor programs are based on two simple ideas, taken from text editors: \search" instructions are used to select regions of interest in a document, and \cut & paste" to restructure them. We study the expressive power and the(More)
Data extraction from web pages is performed by software modules called wrappers. Recently, some systems for the automatic generation of wrappers have been proposed in the literature. These systems are based on unsupervised inference techniques: taking as input a small set of sample pages, they can produce a common wrapper to extract relevant data. However,(More)
Recent results in schema-mapping and data-exchange research may be considered the starting point for a new generation of systems, capable of dealing with a significantly larger class of applications. In this paper we demonstrate the first of these second-generation systems, called ++Spicy. We introduce a number of scenarios from a variety of data management(More)
Data-intensive Web sites are large sites based on a back-end database, with a fairly complex hypertext structure. The paper develops two main contributions: (a) a specific design methodology for data-intensive Web sites, composed of a set of steps and design transformations that lead from a conceptual specification of the domain of interest to the actual(More)
We develop a new algorithm for clustering search results. Differently from many other clustering systems that have been recently proposed as a post-processing step for Web search engines, our system is not based on phrase analysis inside snippets, but instead uses Latent Semantic Indexing on the whole document content. A main contribution of the paper is a(More)