• Corpus ID: 249191275

Dynamic Thresholding for Online Distributed Data Selection

  title={Dynamic Thresholding for Online Distributed Data Selection},
  author={Mariel A. Werner and Anastasios Nikolas Angelopoulos and Stephen Bates and Michael I. Jordan},
The blessing of ubiquitous data also comes with a curse: the communication, storage, and labeling of massive, mostly redundant datasets. We seek to solve this problem at its core, collecting only valuable data and throwing out the rest via submodular maximization. Specifically, we develop algorithms for the online and distributed version of the problem, where data selection occurs in an uncoordinated fashion across multiple data streams. We design a general and flexible core selection routine for… 

Figures and Tables from this paper


Cardinality constrained submodular maximization for random streams
This work simplifies both the algorithm and the analysis, obtaining an exponential improvement in the ε -dependence, and gives a simple (1 /e − ε ) -approximation for non-monotone functions in O ( k/ε ) memory.
SIMILAR: Submodular Information Measures Based Active Learning In Realistic Scenarios
It is argued that SIMILAR not only works in standard active learning but also easily extends to the realistic settings considered above and acts as a one-stop solution for active learning that is scalable to large real-world datasets.
PRISM: A Rich Class of Parameterized Submodular Information Measures for Guided Subset Selection
This work demonstrates the superiority of PRISM over the state-of-the-art in targeted learning and in guided imagecollection summarization, and interestingly generalizes some past work, therein reinforcing its broad utility.
Submodular Combinatorial Information Measures with Applications in Machine Learning
This paper studies combinatorial information measures that generalize independence, (conditional) entropy, (Conditional) mutual information, and total correlation defined over sets of (not necessarily random) variables and shows that, unlike entropic mutual information in general, the submodular mutual information is actually sub modular in one argument, holding the other fixed.
An Efficient Framework for Balancing Submodularity and Cost
This paper considers a generalization of the classical selection problem where the goal is to optimize a function that maximizes the submodular function ƒ minus a linear cost function cost and designs algorithms that have provable approximation guarantees, are extremely efficient and work very well in practice.
A Simple Framework for Contrastive Learning of Visual Representations
It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.
PyTorch: An Imperative Style, High-Performance Deep Learning Library
This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.
Submodular Streaming in All its Glory: Tight Approximation, Minimum Memory and Low Adaptive Complexity
This paper proposes Sieve-Streaming++, which requires just one pass over the data, keeps only $O(k)$ elements and achieves the tight $(1/2)$-approximation guarantee, and demonstrates the efficiency of the algorithms on real-world data summarization tasks for multi-source streams of tweets and of YouTube videos.
Accelerated greedy algorithms for maximizing submodular set functions
A family of approximate solution methods is studied : the greedy algorithms for optimal subset problems given a finite set E and a real valued function f on P(E) (the power set of E).
Submodular Optimization in the MapReduce Model
This paper presents two simple algorithms for cardinality constrained submodular optimization in the MapReduce model: the first is a $(1/2-o(1)$-approximation in 2 Map Reduce rounds, and the second is a ($1-1/e-\epsilon)-app approximation in $\frac{1+o (1)}{\ep silon}$ Map reduce rounds.