• Publications
  • Influence
The Eigentrust algorithm for reputation management in P2P networks
TLDR
An algorithm to decrease the number of downloads of inauthentic files in a peer-to-peer file-sharing network that assigns each peer a unique global trust value, based on the peer's history of uploads is described. Expand
Similarity flooding: a versatile graph matching algorithm and its application to schema matching
TLDR
This paper presents a matching algorithm based on a fixpoint computation that is usable across different scenarios and conducts a user study, in which the accuracy metric was used to estimate the labor savings that the users could obtain by utilizing the algorithm to obtain an initial matching. Expand
Combating Web Spam with TrustRank
TLDR
This paper proposes techniques to semi-automatically separate reputable, good pages from spam, and shows that they can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites. Expand
Database Systems: The Complete Book
From the Publisher: This introduction to database systems offers a readable comprehensive approach with engaging, real-world examples—users will learn how to successfully plan a database applicationExpand
Crawling the Hidden Web
TLDR
A generic operational model of a hidden Web crawler is introduced and how this model is realized in HiWE (Hidden Web Exposer), a prototype crawler built at Stanford is described. Expand
Efficient Crawling Through URL Ordering
TLDR
This paper studies in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first, and shows that a Crawler with a good ordering scheme can obtain important pages significantly faster than one without. Expand
The TSIMMIS Project: Integration of Heterogeneous Information Sources
TLDR
An overview of the Tsimmis Project is given, describing components that extract properties from unstructured objects, that translate information into a common object model, that combine information from several sources, that allow browsing of information, and that manage constraints across heterogeneous sites. Expand
Web Spam Taxonomy
TLDR
This paper presents a comprehensive taxonomy of current spamming techniques, which it is believed can help in developing appropriate countermeasures. Expand
Swoosh: a generic approach to entity resolution
TLDR
This work formalizes the generic ER problem, treating the functions for comparing and merging records as black-boxes, and identifies four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms. Expand
Extracting structured data from Web pages
TLDR
This paper presents an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Expand
...
1
2
3
4
5
...