Corpus ID: 9870161

SWOOP: Top-k Similarity Joins over Set Streams

@article{Mann2017SWOOPTS,
  title={SWOOP: Top-k Similarity Joins over Set Streams},
  author={W. Mann and Nikolaus Augsten and Christian S. Jensen},
  journal={ArXiv},
  year={2017},
  volume={abs/1711.02476}
}
We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams of sets. [...] Key Method Novel indexing techniques and sophisticated filters efficiently prune useless pairs as new sets enter the window. SWOOP incrementally maintains a stock of similar pairs to update the top-$k$ result at any time, and the stock is shown to be minimal. Our experiments confirm that SWOOP can deal with stream rates that are orders of magnitude faster than the rates of existing…Expand
Distributed Streaming Set Similarity Join
TLDR
A novel bundle-based join algorithm is proposed by grouping similar records on-the-fly to reduce filtering cost and is an efficient verification technique, which verifies a batch of records by utilizing their token differences to share verification costs, rather than verifying them individually. Expand
SETJoin: a novel top-k similarity join algorithm
TLDR
A novel algorithm is proposed, SETJoin, by combining the existing event-driven framework with three simple yet efficient optimization techniques, viz., reducing the cost in hashing by rearranging the orders of the candidate filtering and hash table lookup operations; maximizing the pruning capability of suffix filtering by judiciously choosing the (near) optimal recursion depth. Expand
Similarity Search and Applications: 13th International Conference, SISAP 2020, Copenhagen, Denmark, September 30 – October 2, 2020, Proceedings
TLDR
This paper presents a meta-analyses of keynotes interactive exploration using Hypergraphs to explore the role of language in the exploration of graph-based knowledge representation. Expand

References

SHOWING 1-10 OF 22 REFERENCES
Continuous monitoring of top-k queries over sliding windows
TLDR
This paper presents two processing techniques: the first one computes the new answer of a query whenever some of the current top-k points expire; the second one partially pre-computes the future changes in the result, achieving better running time at the expense of slightly higher space requirements. Expand
Streaming Similarity Self-Join
TLDR
Extensive experiments show that the STR algorithm, when instantiated with the L2 index, is the most scalable option across a wide array of datasets and parameters. Expand
A Generic Framework for Top-k Pairs and Top-k Objects Queries over Sliding Windows
Top-k pairs and top-k objects queries have received significant attention by the research community. In this paper, we present the first approach to answer a broad class of top-k pairs and top-kExpand
Time- and Space-Efficient Sliding Window Top-k Query Processing
TLDR
An experimental evaluation systematically compares different top-k/w processing algorithms and shows that while competing algorithms offer either time efficiency at the expanse of space efficiency or vice-versa, the algorithms based on the probabilistic k-skyband are both time and space efficient. Expand
Leveraging Set Relations in Exact Set Similarity Join
TLDR
This paper explores index-level set relations and derives an algorithm which incrementally generates the answer of one set from an already computed answer of another similar set rather than compute the answer from scratch to reduce the computational cost. Expand
A Generic Framework for Top-${\schmi k}$ Pairs and Top- ${\schmi k}$ Objects Queries over Sliding Windows
TLDR
This paper presents the first approach to answer a broad class of top-k pairs and top- k objects queries over sliding windows, and demonstrates the superiority of the algorithm over the state-of-the-art algorithm. Expand
Local Similarity Search for Unstructured Text
TLDR
This paper studies the problem of local similarity search to find partially replicated text to identify documents that approximately share a pair of sliding windows which differ by no more than τ tokens, and proposes a cost-aware algorithm to find a good partitioning of the token universe. Expand
Top-k Set Similarity Joins
TLDR
An algorithm, topk-join, is proposed to answer top-k similarity join efficiently, based on the prefix filtering principle and employs tight upper bounding of similarity values of unseen pairs. Expand
Generalizing prefix filtering to improve set similarity joins
TLDR
This work drastically decreases the computational cost of candidate generation by dynamically reducing the number of indexed objects at the expense of increasing the workload of the verification phase, and shows that this trade-off is advantageous: it consistently achieve substantial speed-ups as compared to known algorithms. Expand
An Efficient Partition Based Method for Exact Set Similarity Joins
TLDR
A partition scheme to partition the sets into several subsets and guarantee that two sets are similar only if they share a common subset, and an adaptive grouping mechanism that can reduce the complexity to O(s log s). Expand
...
1
2
3
...