RecSSD: near data processing for solid state drive based recommendation inference

@article{Wilkening2021RecSSDND,
  title={RecSSD: near data processing for solid state drive based recommendation inference},
  author={Mark Wilkening and Udit Gupta and Samuel Hsia and Caroline Trippel and Carole-Jean Wu and David M. Brooks and Gu-Yeon Wei},
  journal={Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems},
  year={2021}
}
  • Mark Wilkening, Udit Gupta, +4 authors Gu-Yeon Wei
  • Published 29 January 2021
  • Computer Science
  • Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
Neural personalized recommendation models are used across a wide variety of datacenter applications including search, social media, and entertainment. State-of-the-art models comprise large embedding tables that have billions of parameters requiring large memory capacities. Unfortunately, large and fast DRAM-based memories levy high infrastructure costs. Conventional SSD-based storage solutions offer an order of magnitude larger capacity, but have worse read latency and bandwidth, degrading… Expand
Supporting Massive DLRM Inference Through Software Defined Memory
  • E. K. Ardestani, Changkyu Kim, +17 authors Vijay Rao
  • Computer Science
  • ArXiv
  • 2021
TLDR
It is shown how underlying technologies such as Nand Flash and 3DXP differentiate, and relate to real world scenarios, enabling from 5% to 29% power savings. Expand
RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance
TLDR
RecPipeAccel (RPAccel), a custom accelerator that jointly optimizes quality, tail-latency, and system throughput, is designed specifically to exploit the distinct design space opened via RecPipe. Expand
TRiM: Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory
TLDR
TRiM, an NDP architecture for accelerating recommendation systems that augments the DRAM datapath with “in-DRAM” reduction units at the DDR4/5 rank/bank-group/bank level and proposes a host-side architecture with hot embedding-vector replication to alleviate the load imbalance that arises across the reduction units. Expand
Persia: A Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters
TLDR
A novel hybrid training algorithm is designed, where the embedding layer and the dense neural network are handled by different synchronization mechanisms; then a system called Persia is built to support this hybridTraining algorithm. Expand
Towards Offloadable and Migratable Microservices on Disaggregated Architectures: Vision, Challenges, and Research Roadmap
TLDR
A critical systems research direction of designing and developing offloadable and migratable microservices on disaggregated architectures is envisioned and a research roadmap is proposed to achieve the envisioned objectives in a promising way. Expand
ECRM: Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding
TLDR
Compared to checkpointing, ECRM reduces training-time overhead for large DLRMs by up to 88%, recovers from failures up to 10.3× faster, and allows training to proceed during recovery, and shows the promise of erasure coding in imparting efficient fault tolerance to training current and futureDLRMs. Expand
FlashEmbedding
  • Hu Wan, Xuan Sun, Yufei Cui, Chia-Lin Yang, Tei-Wei Kuo, Chun Jason Xue
  • Proceedings of the 12th ACM SIGOPS Asia-Pacific Workshop on Systems
  • 2021
FlashEmbedding: storing embedding tables in SSD for large-scale recommender systems
TLDR
This work presents FlashEmbedding, a hardware/software co-design solution for storing embedding tables on SSDs for large-scale recommendation inference under memory capacity-limited systems, which achieves up to 17.44× lower latency in embedding lookups and 2.89× lower end-to-end latency than baseline solution in a memorycapacity-limted system. Expand
Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters
  • Xiangru Lian, Binhang Yuan, +24 authors Ji Liu
  • Computer Science
  • ArXiv
  • 2021
Deep learning based models have dominated the current landscape of production recommender systems. Furthermore, recent years have witnessed an exponential growth of the model scale—from Google’s 2016Expand

References

SHOWING 1-10 OF 61 REFERENCES
RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing
  • Liu Ke, Udit Gupta, +18 authors Xiaodong Wang
  • Computer Science
  • 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)
  • 2020
TLDR
RecNMP is proposed which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models, and is specifically tailored to production environments with heavy co-location of operators on a single server. Expand
The Architectural Implications of Facebook's DNN-Based Personalized Recommendation
TLDR
A set of real-world, production-scale DNNs for personalized recommendation coupled with relevant performance metrics for evaluation are presented and in-depth analysis is conducted that underpins future system design and optimization for at-scale recommendation. Expand
Understanding Capacity-Driven Scale-Out Neural Recommendation Inference
TLDR
This work specifically explores latency-bounded inference systems, compared to the throughput-oriented training systems of other recent works, and finds that the latency and compute overheads of distributed inference are largely attributed to a model's static embedding table distribution and sparsity of inference request inputs. Expand
DeepRecSys: A System for Optimizing End-To-End At-Scale Neural Recommendation Inference
TLDR
DeepRecSched is proposed, a recommendation inference scheduler that maximizes latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, model architectures, and underlying hardware systems. Expand
Deep Learning Recommendation Model for Personalization and Recommendation Systems
TLDR
A state-of-the-art deep learning recommendation model (DLRM) is developed and its implementation in both PyTorch and Caffe2 frameworks is provided and a specialized parallelization scheme utilizing model parallelism on the embedding tables to mitigate memory constraints while exploiting data parallelism to scale-out compute from the fully-connected layers is designed. Expand
Bandana: Using Non-volatile Memory for Storing Deep Learning Models
TLDR
Bandana is presented, a storage system that reduces the DRAM footprint of embeddings, by using Non-volatile Memory (NVM) as the primary storage medium, with a small amount of DRAM as cache. Expand
TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning
TLDR
This paper presents a vertically integrated hardware/software co-design, which includes a custom DIMM module enhanced with near-memory processing cores tailored for DL tensor operations, populated inside a GPU-centric system interconnect as a remote memory pool. Expand
Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems
TLDR
A distributed GPU hierarchical parameter server for massive scale deep learning ads systems that utilizes GPU High-Bandwidth Memory, CPU main memory and SSD as 3-layer hierarchical storage and the price-performance ratio of the proposed system is 4-9 times better than an MPI-cluster solution. Expand
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications
TLDR
Detailed characterizations of deep learning models used in many Facebook social network services are provided and the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers is highlighted. Expand
AIBox: CTR Prediction Model Training on a Single Node
TLDR
AIBox is presented, a centralized system to train CTR models with tens-of-terabytes-scale parameters by employing solid-state drives (SSDs) and GPUs, and a bi-level cache management system over SSDs to store the 10TB parameters while providing low-latency accesses. Expand
...
1
2
3
4
5
...