DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement

@article{Lin2021DREAMPlaceDL,
  title={DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement},
  author={Yibo Lin and Zixuan Jiang and Jiaqi Gu and Wuxi Li and Shounak Dhar and Haoxing Ren and Brucek Khailany and David Z. Pan},
  journal={IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems},
  year={2021},
  volume={40},
  pages={748-761}
}
  • Yibo LinZixuan Jiang D. Pan
  • Published 22 June 2020
  • Computer Science
  • IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Placement for very large-scale integrated (VLSI) circuits is one of the most important steps for design closure. We propose a novel GPU-accelerated placement framework DREAMPlace, by casting the analytical placement problem equivalently to training a neural network. Implemented on top of a widely adopted deep learning toolkit <monospace>PyTorch</monospace>, with customized key kernels for wirelength and density computations, DREAMPlace can achieve around <inline-formula> <tex-math notation… 

DREAMPlace 2.0: Open-Source GPU-Accelerated Global and Detailed Placement for Large-Scale VLSI Designs

This work presents an open-source placement framework, DREAMPlace 2.01, with deep learning toolkit-enabled GPU acceleration for both global and detailed placement optimization to tackle the issues of efficiency and development overhead.

CU.POKer: Placing DNNs on Wafer-Scale Al Accelerator with Optimal Kernel Sizing

CU.POKer is proposed, a high-performance engine fully-customized for WSE's DNN workload placement challenge, with a provably optimal placeable kernel candidate searching scheme and a data-flow-aware placement tool to ensure the state-of-the-art quality on the real industrial benchmarks.

Xplace: an extremely fast and extensible global placement framework

This work develops an extremely fast GPU accelerated global placer Xplace which achieves around 2x speedup with better solution quality compared to DREAMPlace and plugs a novel Fourier neural network into Xplace as an extension to further improve the solution quality.

ABCDPlace: Accelerated Batch-Based Concurrent Detailed Placement on Multithreaded CPUs and GPUs

This article presents a concurrent detailed placement framework, ABCDPlace, exploiting multithreading and graphic processing unit (GPU) acceleration and proposes batch-based concurrent algorithms for widely adopted sequential detailed placement techniques, such as independent set matching, global swap, and local reordering.

Towards Machine Learning for Placement and Routing in Chip Design: a Methodological Overview

This survey starts with the introduction of basics of placement and routing, with a brief description on classic learning-free solvers, and presents detailed review on recent advance in machine learning for placement and routed.

Opportunities for RTL and Gate Level Simulation using GPUs (Invited Talk)

The idea that coding frameworks usually used for popular machine learning topics, such as PyTorch/DGL.ai, can also be used for exploring simulation purposes, and a crude oblivious two-value cycle gate-level simulator is demo that exhibits >20X speedup, despite its simplistic construction.

On Joint Learning for Solving Placement and Routing in Chip Design

A joint learning method termed by DeepPlace is proposed for the placement of macros and standard cells, by the integration of reinforcement learning with a gradient based optimization scheme and a joint learning approach via reinforcement learning to fulfill both macro placement and routing, which is called DeepPR.

CNN-inspired analytical global placement for large-scale heterogeneous FPGAs

This paper presents a CNN-inspired analytical placement algorithm to effectively handle the redundant frequency translation problem for large-scale FPGAs and formulate a refined objective function and a degree-specific gradient preconditioning to achieve a robust, high-quality solution.

DrPlace: A Deep Learning Based Routability-Driven VLSI Placement Algorithm

A deep learning based routability-driven VLSI placement algorithm named DrPlace is proposed, which adds the pin density function into the global placement model and proposes an efficient GPU implementation of pin density key kernel.

Ultrafast CPU/GPU Kernels for Density Accumulation in Placement

This paper proposes efficient CPU/GPU kernels for density accumulation by decomposing the problem into two phases: constant-time density collection for each instance and a linear-time prefix sum.
...

References

SHOWING 1-10 OF 40 REFERENCES

DREAMPIace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement

A novel GPU-accelerated placement framework DREAMPlace is proposed, by casting the analytical placement problem equivalently to training a neural network, to achieve over 30 times speedup in global placement without quality degradation compared to the state-of-the-art multi-threaded placer RePlAce.

ABCDPlace: Accelerated Batch-Based Concurrent Detailed Placement on Multithreaded CPUs and GPUs

This article presents a concurrent detailed placement framework, ABCDPlace, exploiting multithreading and graphic processing unit (GPU) acceleration and proposes batch-based concurrent algorithms for widely adopted sequential detailed placement techniques, such as independent set matching, global swap, and local reordering.

PyTorch: An Imperative Style, High-Performance Deep Learning Library

This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.

Parallel multi-level analytical global placement on graphics processing units

  • J. CongYi Zou
  • Computer Science
    2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers
  • 2009
This paper describes the implementation of a state-of-the-art academic multi-level analytical placer mPL on Nvidia's massively parallel GT200 series platforms, and details the efforts on performance tuning and optimizations.

GDP: GPU accelerated Detailed Placement

  • Shounak DharD. Pan
  • Computer Science
    2018 IEEE High Performance extreme Computing Conference (HPEC)
  • 2018
This paper demonstrates GPU acceleration of a dynamic programming based detailed placement algorithm which solves a generalized version of the Linear Arrangement Problem and achieves upto 7x speedup in runtime over multi-threaded CPU implementation without any loss of QoR.

Accelerate analytical placement with GPU: A generic approach

A generic approach of exploiting GPU parallelism to speed up the essential computations in VLSI nonlinear analytical placement by utilizing the sparse characteristic of circuit graph to transform the compute-intensive portions into sparse matrix multiplications, which effectively optimizes the memory access pattern and mitigates the imbalance workload.

UTPlaceF 3.0: A parallelization framework for modern FPGA global placement: (Invited paper)

A parallelization framework for modern FPGA global placement, UTPlaceF 3.0 is proposed and two major techniques are presented to boost the performance of a state-of-the-art quadratic placer with only small quality degradation.

High-quality, deterministic parallel placement for FPGAs on commodity hardware

This paper describes the application of two parallelization strategies to the Quartus II FPGA placer, and describes a process to quantify multi-core performance effects, such as memory subsystem limitations and explicit synchronization overhead, and fully describe these effects on a CAD tool for the first time.

MAPLE: multilevel adaptive placement for mixed-size designs

We propose a new multilevel framework for large-scale placement called MAPLE that respects utilization constraints, handles movable macros and guides the transition between global and detailed

elfPlace: Electrostatics-based Placement for Large-Scale Heterogeneous FPGAs

  • Wuxi LiYibo LinD. Pan
  • Computer Science
    2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)
  • 2019
Besides pure-wirelength minimization, this work proposes a unified instance area adjustment scheme to simultaneously optimize routability, pin density, and downstream clustering compatibility and an augmented Lagrangian formulation together with a preconditioning technique and a normalized subgradient-based multiplier updating scheme are proposed.