Corpus ID: 220265858

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

@article{Lepikhin2021GShardSG,
  title={GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding},
  author={Dmitry Lepikhin and H. Lee and Yuanzhong Xu and Dehao Chen and Orhan Firat and Y. Huang and M. Krikun and Noam M. Shazeer and Z. Chen},
  journal={ArXiv},
  year={2021},
  volume={abs/2006.16668}
}
Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA… Expand
CoCoNet: Co-Optimizing Computation and Communication for Distributed Machine Learning
Automatic Graph Partitioning for Very Large-scale Deep Learning
GSPMD: General and Scalable Parallelization for ML Computation Graphs
Doing more with less: training large DNN models on commodity servers for the masses
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 88 REFERENCES
Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Beyond Data and Model Parallelism for Deep Neural Networks
Glow: Graph Lowering Compiler Techniques for Neural Networks
ZeRO: Memory Optimization Towards Training A Trillion Parameter Models
Relay: a new IR for machine learning frameworks
Compiling machine learning programs via high-level tracing
Controlling Computation versus Quality for Neural Sequence Models
PipeDream: Fast and Efficient Pipeline Parallel DNN Training
...
1
2
3
4
5
...