# apricot: Submodular selection for data summarization in Python

@article{Schreiber2019apricotSS, title={apricot: Submodular selection for data summarization in Python}, author={Jacob M. Schreiber and Jeff A. Bilmes and William Stafford Noble}, journal={J. Mach. Learn. Res.}, year={2019}, volume={21}, pages={161:1-161:6} }

We present apricot, an open source Python package for selecting representative subsets from large data sets using submodular optimization. The package implements an efficient greedy selection algorithm that offers strong theoretical guarantees on the quality of the selected set. Two submodular set functions are implemented in apricot: facility location, which is broadly applicable but requires memory quadratic in the number of examples in the data set, and a feature-based function that is less…

## 21 Citations

### Submodlib: A Submodular Optimization Library

- Computer ScienceArXiv
- 2022

SUBMODLIB is an open-source, easy-to-use, efficient and scalable Python library for submodular optimization with a C++ optimization engine that finds its application in summarization, data subset selection, hyper parameter tuning, efficient training and more.

### FastClass: A Time-Efficient Approach to Weakly-Supervised Text Classification

- Computer Science, Economics
- 2022

The proposed FastClass approach uses dense text representation to retrieve class-relevant documents from external unlabeled corpus and selects an optimal subset to train a classiﬁer, and is less reliant on initial class descriptions as it no longer needs to expand each class description into a set of class-speci ﬁc keywords.

### Active Learning in Bayesian Neural Networks with Balanced Entropy Learning Principle

- Computer Science
- 2021

This paper designs and proposes a new uncertainty measure, Balanced Entropy Acquisition (BalEntAcq), which captures the information balance between the uncertainty of underlying softmax probability and the label variable, and demonstrates that it consistently outperforms well-known linearly scalable active learning methods.

### MQRetNN: Multi-Horizon Time Series Forecasting with Retrieval Augmentation

- Computer ScienceArXiv
- 2022

The new neural architecture, MQRetNN, leverages the encoded contexts from a pretrained baseline model on the entire population to improve forecasting accuracy and is demonstrated how it is possible to achieve 3% improvement in test loss by adding a cross-entity attention mechanism.

### EigenRank by Committee: A Data Subset Selection and Failure Prediction paradigm for Robust Deep Learning based Medical Image Segmentation

- Computer ScienceArXiv
- 2019

This work proposes a novel algorithm, named Eigenrank, which can select for manual labeling, a subset of medical images from a large database, such that a U-Net trained on this subset is superior to one trained on a randomly selected subset of the same size.

### Practical selection of representative sets of RNA-seq samples using a hierarchical approach

- Computer Science, BiologybioRxiv
- 2021

Hierarchical representative set selection is a divide-and-conquer-like algorithm that breaks the representativeSet selection into sub-selections and hierarchically selects representative samples through multiple levels that can achieve performance close to that of direct representative setselection, while largely reducing the runtime and memory requirements of computing the full similarity matrix.

### Effective Evaluation of Deep Active Learning on Image Classification Tasks

- Computer ScienceArXiv
- 2021

This work presents a unified re-implementation of state-of-the-art AL algorithms in the context of image classification, and shows that AL techniques are 2× to 4× more label-efficient compared to RS with the use of data augmentation.

### A learned embedding for efficient joint analysis of millions of mass spectra

- Computer SciencebioRxiv
- 2022

This work proposes to train a deep neural network in a supervised fashion based on previous assignments of peptides to spectra, called “GLEAMS,” which learns to embed spectra into a low-dimensional space in which spectra generated by the same peptide are close to one another.

### OP-CBIO210319 334..341

- Computer Science, Biology
- 2021

Hierarchical representative set selection is a divide-and-conquer-like algorithm that breaks representativeSet selection into sub-selections and hierarchically selects representative samples through multiple levels that can achieve summarization quality close to that of direct representative setselection, while largely reducing runtime and memory requirements of computing the full similarity matrix.

### PEACEPACT: Prioritizing Examples to Accelerate Perturbation-Based Adversary Generation for DNN Classification Testing

- Computer Science2020 IEEE 20th International Conference on Software Quality, Reliability and Security (QRS)
- 2020

This paper proposes a technique to select adversaries more effectively by exploiting their class distinguishability, and results reveal that the vulnerability of examples has a strong relationship with distinguishability.

## References

SHOWING 1-10 OF 23 REFERENCES

### Submodularity for Data Selection in Machine Translation

- Computer ScienceEMNLP
- 2014

By explicitly formulating data selection as a submodular program, this work obtains fast scalable selection algorithms with mathematical performance guarantees, resulting in a unified framework that clarifies existing approaches and also makes both new and many previous approaches easily accessible.

### An Application of the Submodular Principal Partition to Training Data Subset Selection

- Computer Science, Mathematics
- 2010

The principal partition is applied to the problem of finding a subset of a large training data set (corpus) that is useful for accurately and rapidly prototyping novel and computationally expensive machine learning architectures as an minimization problem over a weighted sum of modular functions and submodular functions.

### Submodularity in Data Subset Selection and Active Learning

- Computer ScienceICML
- 2015

The connection of submodularity to the data likelihood functions for Naive Bayes and Nearest Neighbor classifiers is shown, and the data subset selection problems for these classifiers are formulated as constrained submodular maximization.

### Auto-Summarization: A Step Towards Unsupervised Learning of a Submodular Mixture

- Computer ScienceSDM
- 2019

This work introduces an approach that requires the specification of only a handful of hyperparameters to determine a mixture of submodular functions for use in data science applications and introduces a mixture weight learning approach that does not (as is common) directly utilize supervised summary information.

### Distributed Submodular Maximization

- Computer ScienceJ. Mach. Learn. Res.
- 2016

This paper develops a simple, two-stage protocol GreeDi, that is easily implemented using MapReduce style computations and demonstrates the effectiveness of the approach on several applications, including sparse Gaussian process inference and exemplar based clustering on tens of millions of examples using Hadoop.

### Fast Multi-stage Submodular Maximization

- Computer ScienceICML
- 2014

It is shown that MULTGREED performs very closely to the standard greedy algorithm given appropriate surrogate functions and it is argued how the framework can easily be integrated with distributive algorithms for further optimization.

### Lazier Than Lazy Greedy

- Computer ScienceAAAI
- 2015

The first linear-time algorithm for maximizing a general monotone submodular function subject to a cardinality constraint is developed, and it is shown that the randomized algorithm, STOCHASTIC-GREEDY, can achieve a (1 − 1/e − ε) approximation guarantee, in expectation, to the optimum solution in time linear in the size of the data.

### SFO: A Toolbox for Submodular Function Optimization

- Computer ScienceJ. Mach. Learn. Res.
- 2010

SFO is presented, a toolbox for use in MATLAB or Octave that implements algorithms for minimization and maximization of submodular functions that allows one to efficiently find provably (near-) optimal solutions for large problems.

### How to select a good training-data subset for transcription: submodular active selection for sequences

- Computer ScienceINTERSPEECH
- 2009

Abstract : Given a large un-transcribed corpus of speech utterances, we address the problem of how to select a good subset for word-level transcription under a given fixed transcription budget. We…

### An analysis of approximations for maximizing submodular set functions—I

- MathematicsMath. Program.
- 1978

It is shown that a “greedy” heuristic always produces a solution whose value is at least 1 −[(K − 1/K]K times the optimal value, which can be achieved for eachK and has a limiting value of (e − 1)/e, where e is the base of the natural logarithm.