Dima: A Distributed In-Memory Similarity-Based Query Processing System

@article{Sun2017DimaAD,
  title={Dima: A Distributed In-Memory Similarity-Based Query Processing System},
  author={Ji Sun and Zeyuan Shang and Guoliang Li and Dong Deng and Zhifeng Bao},
  journal={Proc. VLDB Endow.},
  year={2017},
  volume={10},
  pages={1925-1928}
}
Data analysts in industries spend more than 80% of time on data cleaning and integration in the whole process of data analytics due to data errors and inconsistencies. It calls for effective query processing techniques to tolerate the errors and inconsistencies. In this paper, we develop a distributed in-memory similarity-based query processing system called Dima. Dima supports two core similarity-based query operations, i.e., similarity search and similarity join. Dima extends the SQL… 

Figures from this paper

Balance-Aware Distributed String Similarity-Based Query Processing System
TLDR
This paper develops a distributed in-memory similarity-based query processing system called Dima, and proposes balance-aware signatures where two records are similar if they share common signatures, and can adaptively select the signatures to balance the workload.
Supporting Similarity Queries in Apache AsterixDB
TLDR
The support for similarity queries in Apache AsterixDB, a parallel, open-source Big Data management system for NoSQL data, is described, including the support provided at the query language level, indexing, execution plans, plan rewrites to optimize query execution, and so on.
Semi-Stream Similarity Join Processing in a Distributed Environment
TLDR
DSim-Join minimizes the data transmission, reduces database accesses using a cache in a distributed stream processing engine, parallelizes join processing, and balances the load between parallel join threads.
Parallelizing String Similarity Join Algorithms
TLDR
This paper proposes a parallelization framework for string similarity joins that utilizes existing SSJ algorithms and partitions the data using a variety of partitioning strategies and then executes theSSJ algorithms on the partitions in parallel.
Set Similarity Joins on MapReduce: An Experimental Survey
TLDR
This paper surveys ten recent, distributed set similarity join algorithms, all based on the MapReduce paradigm and empirically compares the algorithms in a uniform test environment on twelve datasets that expose different characteristics and represent a broad range of applications.
Internal and external memory set containment join
TLDR
A novel adaptive data partition method that is designed to fully leverage the available memory and achieve high I/O efficiency, and thereby exhibiting outstanding performance for external memory set containment join.
2 3 5 2 3 5-- ma ar rl la 5 ✔ 3 ✘ 2 ✘
TLDR
The support for similarity queries in Apache AsterixDB, a parallel, open-source Big Data management system for NoSQL data, is described, including the support provided at the query language level, indexing, execution plans, plan rewrites to optimize query execution, and so on.
LCJoin: Set Containment Join via List Crosscutting
TLDR
The prefix tree structure is utilized and extended and the novel list intersection method is extended to operate on the prefix tree to improve the efficiency and share computation in set containment join methods.
Human-in-the-loop Data Integration
TLDR
A hybrid human-machine data integration framework that harnesses human ability to address this problem, and applies initially to the problem of entity matching, and develops a crowd-powered database system CDB.
...
1
2
3
...

References

SHOWING 1-10 OF 11 REFERENCES
Efficient parallel set-similarity joins using MapReduce
TLDR
This paper proposes a 3-stage approach for end-to-end set-similarity joins in parallel using the popular MapReduce framework, and reports results from extensive experiments on real datasets to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.
Efficient similarity joins for near-duplicate detection
TLDR
This article proposes new filtering techniques by exploiting the token ordering information and drastically reduce the candidate sizes and hence improve the efficiency of existing algorithms to find a pair of records such that their similarities are no less than a given threshold.
MassJoin: A mapreduce-based method for scalable string similarity joins
TLDR
A MapReduce-based framework, called MASSJOIN, is proposed, which supports both set-based similarity functions and character-based Similarity functions, and extends the existing partition-based signature scheme to support set- based similarity functions.
PASS-JOIN: A Partition-based Method for Similarity Joins
TLDR
This paper study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold, using a partition-based method called Pass-Join.
Efficient set joins on similarity predicates
TLDR
This paper presents an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance that generalize to several weighted and unweighted measures of partial word overlap between sets.
String Similarity Joins: An Experimental Evaluation
TLDR
This paper provides a comprehensive survey on a wide spectrum of existing string similarity join algorithms, classify them into different categories based on their main techniques, and compare them through extensive experiments on a variety of real-world datasets with different characteristics.
Can we beat the prefix filtering?: an adaptive framework for similarity join and search
TLDR
This paper proposes an adaptive framework to support similarity join, and proposes a cost model to judiciously select an appropriate prefix for each object to efficiently select prefixes.
An Efficient Partition Based Method for Exact Set Similarity Joins
TLDR
A partition scheme to partition the sets into several subsets and guarantee that two sets are similar only if they share a common subset, and an adaptive grouping mechanism that can reduce the complexity to O(s log s).
Efficient exact set-similarity joins
TLDR
This paper proposes new algorithms for SSJoin that are exact, i.e., they always produce the correct answer, and they carry precise performance guarantees, which are believed to be the first to have both features.
Fast-join: An efficient method for fuzzy token matching based string similarity join
TLDR
This paper proposes a new similarity metrics, called “fuzzy token matching based similarity”, which extends token-based similarity functions by allowing fuzzy match between two tokens, and achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods.
...
1
2
...