• Publications
  • Influence
Scaling Distributed Machine Learning with the Parameter Server
TLDR
View on new challenges identified are shared, and some of the application scenarios such as micro-blog data analysis and data processing in building next generation search engines are covered.
clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs
TLDR
This paper proposes a new sparse matrix format, the Cocktail Format, to take advantage of the strengths of many different sparse matrix formats, and develops the clSpMV framework, a framework that is able to analyze all kinds of sparse matrices at runtime, and recommend the best representations of the given sparseMatrices on different platforms.
Efficient, high-quality image contour detection
TLDR
This work examines efficient parallel algorithms for performing image contour detection, with particular attention paid to local image analysis as well as the generalized eigensolver used in Normalized Cuts, and proposes a contour detector that provides uncompromised contour accuracy.
Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems
TLDR
Zion - Facebook's next-generation large-memory training platform that consists of both CPUs and accelerators is designed and the design requirements of future scale-out training systems are discussed.
Parallelizing CAD: A timely research agenda for EDA
TLDR
This work proposes that a key area of CAD research is to identify the design patterns underlying CAD applications and then build CAD application frameworks that aid efficient parallel software implementations of these design patterns.
Routability-driven analytical placement by net overlapping removal for large-scale mixed-size designs
TLDR
A new direction/technique is proposed, called net overlapping removal, to optimize the routability during placement, and a Gaussian smoothing technique is proposed to handle the challenging macro porosity issue, arising in modern mixed-size designs with large macros.
CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
TLDR
CPR relaxes the consistency requirement by enabling non-failed nodes to proceed without loading checkpoints when a node fails during training, improving failure-related overheads and suggesting that CPR can speed up training on a real production-scale cluster, without notably degrading the accuracy.
Parallel Application Library for Object Recognition
TLDR
This dissertation exhaustively examines the design space for three application patterns, and achieves significant speedups on these patterns – 280× speedup on the eigensolver application pattern, 12-33× speed up on the breadth-first-search graph traversal application pattern), and 5-30× speeddown on the contour histogram application pattern.
Ubiquitous Parallel Computing from Berkeley, Illinois, and Stanford
The ParLab at Berkeley, UPCRC-Illinois, and the Pervasive Parallel Laboratory at Stanford are studying how to make parallel programming succeed given industry's recent shift to multicore computing.
An Optimal Jumper-Insertion Algorithm for Antenna Avoidance/Fixing
TLDR
This paper forms an O(V)-time optimal jumper-insertion algorithm that uses the minimum number of jumpers to avoid/fix the antenna violations in a routing tree with vertices and presents the first optimal algorithm for the tree-cutting problem.
...
...