Corpus ID: 14525579

Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design

@article{Merchant2016AcceleratingBO,
  title={Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design},
  author={Farhad Merchant and Tarun Vatwani and Anupam Chattopadhyay and Soumyendu Raha and S. K. Nandy and Ranjani Narayan},
  journal={ArXiv},
  year={2016},
  volume={abs/1610.06385}
}
  • Farhad Merchant, Tarun Vatwani, +3 authors Ranjani Narayan
  • Published in ArXiv 2016
  • Computer Science
  • Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific computing applications. Experimentally, yesteryear multicore and General Purpose Graphics Processing Units (GPGPUs) are capable of achieving up to 15 to 57% of the theoretical peak performance at 65W to 240W respectively for compute bound operations like Double/Single Precision General Matrix Multiplication (XGEMM). For bandwidth bound operations like Single/Double precision Matrix-vector Multiplication… CONTINUE READING

    Create an AI-powered research feed to stay up to date with new papers like this posted to ArXiv

    5
    Twitter Mentions

    Citations

    Publications citing this paper.
    SHOWING 1-2 OF 2 CITATIONS

    Algorithm Architecture Co-design for Dense and Sparse Matrix Computations

    VIEW 8 EXCERPTS
    CITES BACKGROUND & METHODS
    HIGHLY INFLUENCED

    Efficient Realization of Householder Transform Through Algorithm-Architecture Co-Design for Acceleration of QR Factorization

    VIEW 5 EXCERPTS
    CITES BACKGROUND, RESULTS & METHODS

    References

    Publications referenced by this paper.
    SHOWING 1-10 OF 40 REFERENCES

    Micro-architectural Enhancements in Distributed Memory CGRAs for LU and QR Factorizations

    VIEW 3 EXCERPTS

    Efficient Realization of Table Look-Up Based Double Precision Floating Point Arithmetic

    VIEW 3 EXCERPTS

    Efficient QR Decomposition Using Low Complexity Column-wise Givens Rotation (CGR)

    VIEW 1 EXCERPT

    Efficient and scalable CGRA-based implementation of Column-wise Givens Rotation

    VIEW 1 EXCERPT