Alon Y. Halevy

Learn More
We witness a rapid increase in the number of structured information sources that are available online, especially on the WWW. These sources include commercial databases on product information, stock market information, real estate, automobiles, and entertainment. We would like to use the data stored in these databases to answer complex queries that go(More)
Ontologies play a prominent role on the Semantic Web. They make possible the widespread publication of machine understandable data, opening myriad opportunities for automated information processing. However, because of the Semantic Web's distributed nature, data on it will inevitably come from many different ontologies. Information processing across(More)
A data-integration system provides access to a multitude of data sources through a single mediated schema. A key bottleneck in building such systems has been the laborious manual construction of semantic mappings between the source schemas and the mediated schema. We describe LSD, a system that employs and extends current machine-learning techniques to(More)
This paper introduces ULDBs, an extension of relational databases with simple yet expressive constructs for representing and manipulating both lineage and uncertainty. Uncertain data and data lineage are two important areas of data management that have been considered extensively in isolation, however many applications require the features in tandem.(More)
Intuitively, data management and data integration tools should be well-suited for exchanging information in a semantically meaningful way. Unfortunately, they suffer from two significant problems: they typically require a comprehensive schema design before they can be used to store or share information, and they are difficult to extend because schema(More)
The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google’s general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each relational(More)