Program Analysis and Source-Level Communication Optimizations for High-Level Synthesis

Abstract

The use of hardware accelerators, e.g., with GPGPUs or customized circuits using FPGAs, are particularly interesting for accelerating dataand compute-intensive applications. However, to get high performance, it is mandatory to restructure the application code, to generate adequate communication mechanisms, and to compile the different communicating processes so that the resulting application is highly-optimized, with full usage of the memory bandwidth. In the context of the high-level synthesis (HLS) of hardware accelerators, we show how to automatically generate such an optimized organization for an accelerator communicating to an external DDR memory. Our technique relies on loop tiling, the generation of pipelined processes (overlapping communications & computations), and the automatic design (synchronizations and sizes) of local buffers. Our first contribution is a program analysis that specifies the data to be read from and written to the external memory so as to reduce communications and reuse data as much as possible in the accelerator. This specification, which can be used in different contexts, handles the cases where data can be redefined in the accelerator and/or approximations are needed because of non-analyzable data accesses. Our second contribution is an optimized code generation scheme, entirely at source-level, that allows us to compile all the necessary glue (the communication processes) with the same HLS tool as for the computation kernel. Both contributions use advanced polyhedral techniques for program analysis and transformation. Experiments with Altera HLS tools show the correctness and efficiency of our technique. Key-words: Polyhedral optimizations, communication optimizations, pipelined processes, DDR memory, hardware accelerators, HLS. ∗ Compsys, LIP, UMR 5668 CNRS, INRIA, ENS-Lyon, UCB-Lyon in ria -0 06 01 82 2, v er si on 1 25 J ul 2 01 1 Analyse de programme et optimisations des communications au niveau source pour la synthèse de haut niveau Résumé : Les accélérateurs matériels, comme par exemple via l’utilisation de GPGPUs ou de circuits dédiés sur FPGAs, sont particulièrement intéressants pour accélérer les applications gourmandes en calculs et en accès aux données. En revanche, pour obtenir de bonnes performances, il est indispensable de restructurer le code de l’application, de générer des mécanismes de communication adéquats, et de compiler les différents processus communicants de sorte que l’application résultante soit hautement optimisée, avec un bon usage de la bande passante vers la mémoire. Dans le contexte de la synthèse de haut niveau (HLS) d’accélérateurs matériels, nous montrons comment générer automatiquement une telle organisation optimisée pour un accélérateur communiquant avec une mémoire externe DDR. Notre technique repose sur le « tiling » (calcul par bloc), la génération de processus pipelinés (en recouvrant calculs et communications), et la conception automatique (synchronisations et tailles) de « buffers » locaux. Notre première contribution est une analyse de programme qui spécifie les données à lire depuis la mémoire externe et à écrire dans cette mémoire de façon à réduire les communications et à réutiliser les données, autant que faire se peut, dans l’accélérateur. Cette spécification, qui peut être utilisée dans d’autres contextes, prend en compte les cas où les données peuvent être redéfinies dans l’accélérateur et/ou des approximations sont nécessaires du fait d’accès non analysables. Notre seconde contribution est un schéma de génération de code optimisé, entièrement au niveau source, qui nous permet de compiler tous les mécanismes d’optimisation (les processus communicants) avec le même outil de HLS que le noyau de calcul lui-même. Ces deux contributions utilisent des techniques polyédriques avancées d’analyse et de transformation de programme. Les expérimentations menées avec l’outil de synthèse d’Altera C2H montre la correction et l’efficacité de notre technique. Mots-clés : Optimisations polyédriques, optimisations des communications, processus pipelinés, mémoire DDR, accélérateurs matériels, synthèse de haut niveau. in ria -0 06 01 82 2, v er si on 1 25 J ul 2 01 1 Program Analysis and Source-Level Communication Optimizations for High-Level Synthesis Abstract – The use of hardware accelerators, e.g., with GPGPUs or customized circuits using FPGAs, are particularly interesting for accelerating dataand compute-intensive applications. However, to get high performance, it is mandatory to restructure the application code, to generate adequate communication mechanisms, and to compile the different communicating processes so that the resulting application is highly-optimized, with full usage of the memory bandwidth. In the context of the highlevel synthesis (HLS) of hardware accelerators, we show how to automatically generate such an optimized organization for an accelerator communicating to an external DDR memory. Our technique relies on loop tiling, the generation of pipelined processes (overlapping communications & computations), and the automatic design (synchronizations and sizes) of local buffers. Our first contribution is a program analysis that specifies the data to be read from and written to the external memory so as to reduce communications and reuse data as much as possible in the accelerator. This specification, which can be used in different contexts, handles the cases where data can be redefined in the accelerator and/or approximations are needed because of nonanalyzable data accesses. Our second contribution is an optimized code generation scheme, entirely at source-level, that allows us to compile all the necessary glue (the communication processes) with the same HLS tool as for the computation kernel. Both contributions use advanced polyhedral techniques for program analysis and transformation. Experiments with Altera HLS tools show the correctness and efficiency of our technique. The use of hardware accelerators, e.g., with GPGPUs or customized circuits using FPGAs, are particularly interesting for accelerating dataand compute-intensive applications. However, to get high performance, it is mandatory to restructure the application code, to generate adequate communication mechanisms, and to compile the different communicating processes so that the resulting application is highly-optimized, with full usage of the memory bandwidth. In the context of the highlevel synthesis (HLS) of hardware accelerators, we show how to automatically generate such an optimized organization for an accelerator communicating to an external DDR memory. Our technique relies on loop tiling, the generation of pipelined processes (overlapping communications & computations), and the automatic design (synchronizations and sizes) of local buffers. Our first contribution is a program analysis that specifies the data to be read from and written to the external memory so as to reduce communications and reuse data as much as possible in the accelerator. This specification, which can be used in different contexts, handles the cases where data can be redefined in the accelerator and/or approximations are needed because of nonanalyzable data accesses. Our second contribution is an optimized code generation scheme, entirely at source-level, that allows us to compile all the necessary glue (the communication processes) with the same HLS tool as for the computation kernel. Both contributions use advanced polyhedral techniques for program analysis and transformation. Experiments with Altera HLS tools show the correctness and efficiency of our technique.

9 Figures and Tables

Cite this paper

@inproceedings{Alias2011ProgramAA, title={Program Analysis and Source-Level Communication Optimizations for High-Level Synthesis}, author={Christophe Alias and Alexandru Plesco and Alain Darte}, year={2011} }