• Corpus ID: 246430305

Flashlight: Enabling Innovation in Tools for Machine Learning

  title={Flashlight: Enabling Innovation in Tools for Machine Learning},
  author={Jacob Kahn and Vineel Pratap and Tatiana Likhomanenko and Qiantong Xu and Awni Y. Hannun and Jeff Cai and Paden Tomasello and Ann Lee and Edouard Grave and Gilad Avidov and Benoit Steiner and Vitaliy Liptchinsky and Gabriel Synnaeve and Ronan Collobert},
As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essen-tial framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the difficulties involved in prototyping new computational paradigms with existing frameworks. Large… 

Figures and Tables from this paper

Pseudo-Labeling for Massively Multilingual Speech Recognition

This work proposes a simple pseudo-labeling recipe that works well even with low-resource languages, and can yield a model with better performance for many languages that also transfers well to LibriSpeech.

Multilingual Zero Resource Speech Recognition Base on Self-Supervise Pre-Trained Acoustic Models

This paper is the first attempt to extend the use of pre- trained models into word-level zero-resource speech recognition by tuning the pre-trained models on IPA phoneme transcriptions and decoding with a language model trained on extra texts.

EURO: ESPnet Unsupervised ASR Open-source Toolkit

EURO adopts the state-of-the-art UASR learning method introduced by the Wav2vec-U, originally implemented at FAIRSEQ, which leverages self-supervised speech representations and adversarial training and improves the pipeline’s efficiency and allows EURO to be easily applied to existing datasets in ESPnet.

Blank Collapse: Compressing CTC emission for the faster decoding

This paper analyzes the blank label in CTC beam search deeply and proposes a very simple method to reduce the amount of calculation resulting in faster beam search decode speed, which can get up to 78% faster decoding speed than ordinary beam search decoding with a very small loss of accuracy in LibriSpeech datasets.



DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters

Explore new techniques in Microsoft's open source library called DeepSpeed, which advances large model training by improving scale, speed, cost, and usability, unlocking the ability to train

Beyond Data and Model Parallelism for Deep Neural Networks

A more comprehensive search space of parallelization strategies for DNNs called SOAP, which includes strategies to parallelize a DNN in the Sample, Operation, Attribute, and Parameter dimensions is defined and FlexFlow, a deep learning framework that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel machine is proposed.

Chainer: A Deep Learning Framework for Accelerating the Research Cycle

The Chainer framework is introduced, which intends to provide a flexible, intuitive, and high performance means of implementing the full range of deep learning models needed by researchers and practitioners.

TensorFlow: A system for large-scale machine learning

The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated.

Caffe: Convolutional Architecture for Fast Feature Embedding

Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

ZeRO: Memory optimizations Toward Training Trillion Parameter Models

ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency.

PyTorch: An Imperative Style, High-Performance Deep Learning Library

This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.

Machine Learning Systems are Stuck in a Rut

This paper explains how the evolution of hardware accelerators favors compiler back ends that hyper-optimize large monolithic kernels, and shows how this reliance on high-performance but inflexible kernels reinforces the dominant style of programming model.

Learning to Optimize Tensor Programs

A learning-based framework to optimize tensor programs for deep learning workloads that learns domain-specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants and accelerates the search by effective model transfer across workloads.

In-datacenter performance analysis of a tensor processing unit

  • N. JouppiC. Young D. Yoon
  • Computer Science
    2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)
  • 2017
This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.