Cache and Bandwidth Aware Matrix Multiplication on the GPU

Tech Report UIUCDCS-R-2003-2328, University of Illinois Dept. of Computer Science, Mar. 2003

Recent advances in the speed and programmability of consumer level graphics hardware has sparked a flurry of research that goes beyond the realm of image synthesis and computer graphics. We examine the use of the GPU (graphics processing unit) as a tool for scienti c computing, by analyzing techniques for performing large matrix multiplies in GPU hardware. An earlier method for multiplying matrices on the GPU su ered from problems of memory bandwidth. This paper examines more ecient algorithms that make the implementation of large matrix multiplication on upcoming GPU architectures more competitive, using only 25% of the memory bandwidth and instructions of previous GPU algorithms.

UIUCDCS-R-2003-2328.pdf140.99 KB

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer