• Corpus ID: 182953158

apricot: Submodular selection for data summarization in Python

  title={apricot: Submodular selection for data summarization in Python},
  author={Jacob M. Schreiber and Jeff A. Bilmes and William Stafford Noble},
  journal={J. Mach. Learn. Res.},
We present apricot, an open source Python package for selecting representative subsets from large data sets using submodular optimization. The package implements an efficient greedy selection algorithm that offers strong theoretical guarantees on the quality of the selected set. Two submodular set functions are implemented in apricot: facility location, which is broadly applicable but requires memory quadratic in the number of examples in the data set, and a feature-based function that is less… 

Figures from this paper

Submodlib: A Submodular Optimization Library

SUBMODLIB is an open-source, easy-to-use, efficient and scalable Python library for submodular optimization with a C++ optimization engine that finds its application in summarization, data subset selection, hyper parameter tuning, efficient training and more.

FastClass: A Time-Efficient Approach to Weakly-Supervised Text Classification

The proposed FastClass approach uses dense text representation to retrieve class-relevant documents from external unlabeled corpus and selects an optimal subset to train a classifier, and is less reliant on initial class descriptions as it no longer needs to expand each class description into a set of class-speci fic keywords.

Active Learning in Bayesian Neural Networks with Balanced Entropy Learning Principle

This paper designs and proposes a new uncertainty measure, Balanced Entropy Acquisition (BalEntAcq), which captures the information balance between the uncertainty of underlying softmax probability and the label variable, and demonstrates that it consistently outperforms well-known linearly scalable active learning methods.

MQRetNN: Multi-Horizon Time Series Forecasting with Retrieval Augmentation

The new neural architecture, MQRetNN, leverages the encoded contexts from a pretrained baseline model on the entire population to improve forecasting accuracy and is demonstrated how it is possible to achieve 3% improvement in test loss by adding a cross-entity attention mechanism.

EigenRank by Committee: A Data Subset Selection and Failure Prediction paradigm for Robust Deep Learning based Medical Image Segmentation

This work proposes a novel algorithm, named Eigenrank, which can select for manual labeling, a subset of medical images from a large database, such that a U-Net trained on this subset is superior to one trained on a randomly selected subset of the same size.

Practical selection of representative sets of RNA-seq samples using a hierarchical approach

Hierarchical representative set selection is a divide-and-conquer-like algorithm that breaks the representativeSet selection into sub-selections and hierarchically selects representative samples through multiple levels that can achieve performance close to that of direct representative setselection, while largely reducing the runtime and memory requirements of computing the full similarity matrix.

Effective Evaluation of Deep Active Learning on Image Classification Tasks

This work presents a unified re-implementation of state-of-the-art AL algorithms in the context of image classification, and shows that AL techniques are 2× to 4× more label-efficient compared to RS with the use of data augmentation.

A learned embedding for efficient joint analysis of millions of mass spectra

This work proposes to train a deep neural network in a supervised fashion based on previous assignments of peptides to spectra, called “GLEAMS,” which learns to embed spectra into a low-dimensional space in which spectra generated by the same peptide are close to one another.

OP-CBIO210319 334..341

Hierarchical representative set selection is a divide-and-conquer-like algorithm that breaks representativeSet selection into sub-selections and hierarchically selects representative samples through multiple levels that can achieve summarization quality close to that of direct representative setselection, while largely reducing runtime and memory requirements of computing the full similarity matrix.

PEACEPACT: Prioritizing Examples to Accelerate Perturbation-Based Adversary Generation for DNN Classification Testing

This paper proposes a technique to select adversaries more effectively by exploiting their class distinguishability, and results reveal that the vulnerability of examples has a strong relationship with distinguishability.



Submodularity for Data Selection in Machine Translation

By explicitly formulating data selection as a submodular program, this work obtains fast scalable selection algorithms with mathematical performance guarantees, resulting in a unified framework that clarifies existing approaches and also makes both new and many previous approaches easily accessible.

An Application of the Submodular Principal Partition to Training Data Subset Selection

The principal partition is applied to the problem of finding a subset of a large training data set (corpus) that is useful for accurately and rapidly prototyping novel and computationally expensive machine learning architectures as an minimization problem over a weighted sum of modular functions and submodular functions.

Submodularity in Data Subset Selection and Active Learning

The connection of submodularity to the data likelihood functions for Naive Bayes and Nearest Neighbor classifiers is shown, and the data subset selection problems for these classifiers are formulated as constrained submodular maximization.

Auto-Summarization: A Step Towards Unsupervised Learning of a Submodular Mixture

This work introduces an approach that requires the specification of only a handful of hyperparameters to determine a mixture of submodular functions for use in data science applications and introduces a mixture weight learning approach that does not (as is common) directly utilize supervised summary information.

Distributed Submodular Maximization

This paper develops a simple, two-stage protocol GreeDi, that is easily implemented using MapReduce style computations and demonstrates the effectiveness of the approach on several applications, including sparse Gaussian process inference and exemplar based clustering on tens of millions of examples using Hadoop.

Fast Multi-stage Submodular Maximization

It is shown that MULTGREED performs very closely to the standard greedy algorithm given appropriate surrogate functions and it is argued how the framework can easily be integrated with distributive algorithms for further optimization.

Lazier Than Lazy Greedy

The first linear-time algorithm for maximizing a general monotone submodular function subject to a cardinality constraint is developed, and it is shown that the randomized algorithm, STOCHASTIC-GREEDY, can achieve a (1 − 1/e − ε) approximation guarantee, in expectation, to the optimum solution in time linear in the size of the data.

SFO: A Toolbox for Submodular Function Optimization

SFO is presented, a toolbox for use in MATLAB or Octave that implements algorithms for minimization and maximization of submodular functions that allows one to efficiently find provably (near-) optimal solutions for large problems.

How to select a good training-data subset for transcription: submodular active selection for sequences

Abstract : Given a large un-transcribed corpus of speech utterances, we address the problem of how to select a good subset for word-level transcription under a given fixed transcription budget. We

An analysis of approximations for maximizing submodular set functions—I

It is shown that a “greedy” heuristic always produces a solution whose value is at least 1 −[(K − 1/K]K times the optimal value, which can be achieved for eachK and has a limiting value of (e − 1)/e, where e is the base of the natural logarithm.