论文标题
使用OPENCL上的FPGA上的高性能高级模板计算
High-Performance High-Order Stencil Computation on FPGAs Using OpenCL
论文作者
论文摘要
在本文中,我们评估了使用高级合成的FPGA进行高阶模板计算的性能。我们表明,尽管与一阶相比,此类模具的计算强度和片上记忆的要求更高,但我们的设计技术具有合并的空间和时间阻滞仍然有效。与一阶模具相比,这使我们能够达到相似甚至更高的计算性能。我们使用基于OpenCL的设计,除了参数化性能旋钮外,还可以参数化模板半径。此外,我们表明,我们的性能模型在预测高阶表现时表现出与一阶模具相同的精度。在Intel Arria 10 GX 1150设备上,对于2D和3D星形模板,我们分别达到700和270 GFLOP/S的计算性能,最多达到了四个模板半径为四个。这些结果的表现优于现代Xeon的最先进的Yask框架,用于2D和3D模具,并且在2D模板上的现代Xeon Phi胜过,同时在3D中实现了竞争性能。此外,我们的FPGA设计在几乎所有情况下都可以提高功率效率。
In this paper we evaluate the performance of FPGAs for high-order stencil computation using High-Level Synthesis. We show that despite the higher computation intensity and on-chip memory requirement of such stencils compared to first-order ones, our design technique with combined spatial and temporal blocking remains effective. This allows us to reach similar, or even higher, compute performance compared to first-order stencils. We use an OpenCL-based design that, apart from parameterizing performance knobs, also parameterizes the stencil radius. Furthermore, we show that our performance model exhibits the same accuracy as first-order stencils in predicting the performance of high-order ones. On an Intel Arria 10 GX 1150 device, for 2D and 3D star-shaped stencils, we achieve over 700 and 270 GFLOP/s of compute performance, respectively, up to a stencil radius of four. These results outperform the state-of-the-art YASK framework on a modern Xeon for 2D and 3D stencils, and outperform a modern Xeon Phi for 2D stencils, while achieving competitive performance in 3D. Furthermore, our FPGA design achieves better power efficiency in almost all cases.
