Interactive Navigation of Open Data Linkages

@article{Zhu2017InteractiveNO,
  title={Interactive Navigation of Open Data Linkages},
  author={Erkang Zhu and Ken Q. Pu and Fatemeh Nargesian and Ren{\'e}e J. Miller},
  journal={Proc. VLDB Endow.},
  year={2017},
  volume={10},
  pages={1837-1840}
}
We developed Toronto Open Data Search to support the ad hoc, interactive discovery of connections or linkages between datasets. It can be used to efficiently navigate through the open data cloud. Our system consists of three parts: a user-interface provided by a Web application; a scalable backend infrastructure that supports navigational queries; and a dynamic repository of open data tables. Our system uses LSH Ensemble, an efficient index structure, to compute linkages (attributes in two… Expand
RONIN: Data Lake Exploration
TLDR
RONIN is demonstrated, a tool that enables user exploration of a data lake by seamlessly integrating the two common modalities of discovery: data set search and navigation of a hierarchical structure. Expand
Optimizing Organizations for Navigating Data Lakes
TLDR
This work presents a new probabilistic model of how users interact with an organization and defines the likelihood of a user finding an attribute using the organization, using attribute values and metadata when available. Expand
Data Lake Organization
TLDR
Through a formal user study, it is shown that navigation can help users discover relevant tables that cannot be found by keyword search and that data lake organizations take into account the data lake distribution and outperform an existing hand-curated taxonomy and a common baseline organization. Expand
Organizing Data Lakes for Navigation
TLDR
A new probabilistic model of how users interact with an organization is presented and an approximate algorithm for the data lake organization problem is proposed that can help users find relevant tables that cannot be found by keyword search. Expand
Open Data Integration
TLDR
A new paradigm for thinking about integration is introduced where the focus is on data discovery, but highly efficient internet-scale discovery that is driven by data analysis needs. Expand
KTabulator: Interactive Ad hoc Table Creation using Knowledge Graphs
TLDR
KTabulator is an interactive system to effectively extract, build, or extend ad hoc tables from large corpora, by leveraging their computerized structures in the form of knowledge graphs. Expand
Auctus: A Dataset Search Engine for Data Discovery and Augmentation
The large volumes of structured data currently available, from Web tables to open-data portals and enterprise data, open up new opportunities for progress in answering many important scientific,Expand
Auctus: A Dataset Search Engine for Data Augmentation
TLDR
This demo presents the ongoing efforts to develop a dataset search engine tailored for data augmentation, named Auctus, which automatically discovers datasets on the Web and, different from existing dataset search engines, infers consistent metadata for indexing and supports join and union search queries. Expand
Loki: Streamlining Integration and Enrichment
Data scientists frequently transform data from one form to another while cleaning, integrating, and enriching datasets. Writing such transformations, or “mapping functions" is time-consuming andExpand
Top-k Queries over Digital Traces
TLDR
This work proposes a suite of indexing techniques and algorithms to enable fast query processing for top-k entities based on a mobility model and designs a hierarchical indexing structure to organize entities in a way that closely associated entities tend to appear together. Expand
...
1
2
...

References

SHOWING 1-8 OF 8 REFERENCES
Discovering Linkage Points over Web Data
TLDR
The basic schema-matching step is replaced with a more complex instance-based schema analysis and linkage discovery, and it is shown that even attributes with different meanings can sometimes be useful in aligning data. Expand
The Mannheim Search Join Engine
TLDR
The Mannheim Search Join Engine is presented which automatically performs table extension operations based on a large corpus of Web data originating from the Web or corporate intranets and achieves a coverage close to 100% and a precision around 90% for the tasks of extending tables describing cities, companies, countries, drugs, books, films, and songs. Expand
Finding related tables
TLDR
This work considers the problem of finding related tables in a large corpus of heterogenous tables and proposes a framework that captures several types of relatedness, including tables that are candidates for joins and tables that is candidates for union. Expand
A Large Public Corpus of Web Tables containing Time and Context Metadata
TLDR
A large public corpus of Web tables which contains over 233 million tables and has been extracted from the July 2015 version of the CommonCrawl is presented to provide a common ground for evaluating Web table systems. Expand
Mining of Massive Datasets
TLDR
Determining relevant data is key to delivering value from massive amounts of data and big data is defined less by volume which is a constantly moving target than by its ever-increasing variety, velocity, variability and complexity. Expand
LSH Ensemble: Internet-Scale Domain Search
TLDR
This work presents a new index structure, Locality Sensitive Hashing Ensemble, that solves the domain search problem using set containment at Internet scale, and proves that there exists an optimal partitioning for any distribution. Expand
Approximate nearest neighbors: towards removing the curse of dimensionality
TLDR
Two algorithms for the approximate nearest neighbor problem in high-dimensional spaces are presented, which require space that is only polynomial in n and d, while achieving query times that are sub-linear inn and polynometric in d. Expand
On the resemblance and containment of documents
  • A. Broder
  • Mathematics, History
  • Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171)
  • 1997
TLDR
The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document. Expand