• Publications
  • Influence
Monitoring k-nearest neighbor queries over moving objects
This work proposes two efficient and scalable algorithms using grid indices based on indexing objects and queries for k-nearest neighbor queries over moving objects within a geographic area, and shows that these algorithms significantly outperform R-tree-based solutions.
LSH Ensemble: Internet-Scale Domain Search
It is proved that there exists an optimal partitioning for any data distribution, as observed in Open Data and Web data corpora, and for datasets following a power-law distribution, it can be approximated using equi-depth.
Discovering Linkage Points over Web Data
The basic schema-matching step is replaced with a more complex instance-based schema analysis and linkage discovery, and it is shown that even attributes with different meanings can sometimes be useful in aligning data.
Keyword query cleaning
  • K. Pu, Xiaohui Yu
  • Computer Science, Economics
    Proc. VLDB Endow.
  • 1 August 2008
This paper defines a quality metric of a keyword query, and proposes a number of algorithms for cleaning keyword queries optimally, and demonstrates that the basic optimal query cleaning problem can be solved using a dynamic programming algorithm.
Table Union Search on Open Data
This work defines the table union search problem and presents a probabilistic solution for finding tables that are unionable with a query table within massive repositories, and proposes a data-driven approach that automatically determines the best model to use for each pair of attributes.
Concise descriptions of subsets of structured sets
It is shown that the simple set cover is enough to model a number of realistic database structures, and the application of the theory to summarization of large result sets, (multi) query optimization for ROLAP queries, and XML queries.
Scalable Distributed Processing of K Nearest Neighbor Queries over Moving Objects
This work presents a new index structure called Dynamic Strip Index (DSI), which can better adapt to different data distributions than exiting grid indexes, and proposes a distributed k-NN search algorithm based on DSI, which is more efficient and more predictable than existing approaches.
Making Open Data Transparent: Data Discovery on Open Data
Open Data poses interesting new challenges for data integration research and one of those challenges is data discovery, how can the authors find new data sets within this ever expanding sea of Open Data.
Modeling and control of discrete-event systems with hierarchical abstraction
  • K. Pu
  • Computer Science
  • 2000
In this chapter, the abstraction of E=G would show that the given high-level \supervisor" of Figure 2.19 (b) is in fact not controllable, and must be adjusted.
Data Lake Management: Challenges and Opportunities
This tutorial considers how data lakes are introducing new problems including dataset discovery and how they are changing the requirements for classic problems including data extraction, data cleaning, data integration, data versioning, and metadata management.