Hector Garcia-Molina

Learn More
Peer-to-peer file-sharing networks are currently receiving much attention as a means of sharing and distributing information. However, as recent experience shows, the anonymous, open nature of these networks offers an almost ideal environment for the spread of self-replicating inauthentic files.We describe an algorithm to decrease the number of downloads of(More)
Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine’s results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages from spam. We first select a small set of seed pages to be(More)
Matching elements of two data schemas or two data instances plays a key role in data warehousing, e-business, or even biochemical applications. In this paper we present a matching algorithm based on a fixpoint computation that is usable across different scenarios. The algorithm takes two graphs (schemas, catalogs, or other data structures) as input, and(More)
The goal of the Tsimmis Project is to develop tools that facilitate the rapid integration of heterogeneous information sources that may include both structured and unstructured data This paper gives an overview of the project describ ing components that extract properties from unstructured objects that translate information into a common object model that(More)
We address the problem of providing integrated access to diverse and dynamic information sources. We explain how this problem di ers from the traditional database integration problem and we focus on one aspect of the information integration problem, namely information exchange. We de ne an object-based information exchange model and a corresponding query(More)
Ivfanaging transactions with real-time requirements presents many new problems. In this paper we focus on two: How can we schedule transactions with deadlines? How do the real-time constraints affect concurrency control? We describe a new group of algorithms for scheduling real-time transactions which produce serialixable schedules. We present a model for(More)
Current-day crawlers retrieve content from the publicly indexable Web, i.e., the set of web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration. In particular, they ignore the tremendous amount of high quality content “hidden” behind search forms, in large searchable(More)
In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. We define several importance metrics, ordering schemes, and performance evaluation measures for this(More)
In a peer-to-peer (P2P) system, nodes typically connect to a small set of random nodes (their neighbors), and queries are propagated along these connections. Such query flooding tends to be very expensive. We propose that node connections be influenced by content, so that for example, nodes having many “Jazz” files will connect to other similar nodes. Thus,(More)
TSIMMIS—The Stanford-IBM Manager of Multiple InformationSources—is a system for integrating information. It offers a datamodel and a common query language that are designed to support thecombining of information from many different sources. It also offerstools for generating automatically the components that are needed tobuild systems for integrating(More)