Volkov’s article is made up of a set of presentation slides, 53 in number, and specifically deals with performance bottlenecks for LU factorization and matrix multiplication. To the extent that these computations share numerical operations in common with other compute intensive tasks, the advice given in the article provides a general overview of performance bottlenecks on GPUs.
Of special mention is the fact that article deals only with single precision computational bottlenecks. Even on Nvidia’s latest generation of GPUs (the “Fermi” series), double precision lags single precision performance by a factor of two or so. That said, much of what Volkov has to say is equally applicable to double precision.
Volkov’s motivation for the article — if any is ever needed for seeking improved computational performance — are comparative benchmarks from 2007, in which an Intel Core2 Quad 2.4GHz and a GeForce 8800 GTX are compared for matrix multiplication and LU factorization. The comparison is between Intel’s MKL (Math Kernel Library) and Nvidia’s CUBLAS (CUDA Basic Linear Algebra Subroutines). On a raw computational basis, the GPU has a roughly 5:1 advantage over the CPU. In practice, Volkov shows the advantage to be between 2:1 and zero! A rather disappointing result, but one that motivated Volkov to dig deeper into the problem.
GPUs obtain their raw computational power advantage over CPUs by having vastly more ALUs (Arithmetic Logic Units). However, this advantage in raw compute power can only be realized if the ALUs are all kept busy. Failure to fully occupy the ALUs will throw away this raw compute advantage; but to fully occupy the ALUs with a given computational task requires a good understanding of the GPU architecture.
Using the coding techniques suggested by Volkov yields impressive performance improvements over code written in a way less “sympathetic” to the low-level architecture of GPU. Moreover, the techniques appear to yield a higher fraction of theoretical peak performance over a range of GPU architectures from old to new (GeForce 8600 GTS, 8800 GTX, 9800 GTX, and GTX 280). The net result is about a doubling of performance over Nvidia’s CULAS code. And the same techniques suggested by Volkov have also yielded performance gains for other numerical kernels, such as for the FFT (Fast Fourier Transform).
Some of the bottlenecks pinpointed by Volkov will no doubt be improved in future GPU architectures. That will go some way to relieving the burden of optimizing performance from the application programmer. But anyone wanting to make their computations truly fly on GPUs would do well to be mindful of the how instructions and memory interact on GPUs at a very low level as discussed in Volkov’s article.
Volkov’s BLAS algorithms are now incorporated into newer versions of CULBAS.
Key takeaways:
- Raw computational power of a GPU scales with the number of cores.
- Memory bandwidth on a GPU scales with the number of memory controllers.
- GPUs have no scalar capability in that scalar and pointer operations use full vector length.
- GPU supports variable vector length, with 512 being the maximum. Often, maximum performance can be achieved with vectors of shorter length; 64 elements, for example.
- There is a hard ceiling of 2/3 of peak raw performance when using an operand with shared memory. Volkov contends that this is an inherent bottleneck in the GPU architecture.
- Row pivoting in column-major layout on GPU is slow.
- Don’t run small tasks (vectors and matrices with “N” and “M” dimensions less than several thousand) on GPU because the overhead isn’t worth it.
- Large matrix operations can benefit significantly if spread across multiple GPUs.
- Advice: Keep as much data as possible in registers, but bear in mind that doing so only helps with vector operations.
- Advice: Use shared memory, which is smaller and slower than register memory, as a communication device only. It is not only an issue of memory access (fetch and store) because some instructions run slower (require more clock cycles) when using shared memory.
- Advice: For scalar and pointer operations use shorter vectors, not longer. And, more generally, use shorter vectors not longer; indeed, divide longer vectors and their operations into shorter ones in application code, if necessary.
- Advice: It is sometimes the case that it is better to reduce memory traffic by recomputing results rather than storing and retrieving them from memory.
- Advice: Try fast algorithms that might fail occasionally; check for failure, and recompute if needed using a slower algorithm.
- Advice: Use heterogeneous algorithms (of the type that can run on either the CPU or GPU) and make the switch to GPU only for problems that exceed a certain size, or based on other problem characteristics.
- My advice: Question all assumptions, and all coding advice! If the cost of doing experiments with your code is low, do them.
Notes:
- [1] “Understanding performance bottlenecks in numerical kernels on GPUs” by Vasily Volkov, May 21, 2010, can be found here: http://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/volkov2010-NTU1.pdf