The main contribution of this paper is to present an efficient GPU implementation of bulk computation of eigenvalues for many small, non-symmetric, real matrices. This work is motivated by the necessity of such bulk computation in designing of control systems, which requires to compute the eigenvalues of hundreds of thousands non-symmetric real matrices of size up to 30x30. Several efforts have been devoted to accelerating the eigenvalue computation including computer languages, systems, environments supporting matrix manipulation offering specific libraries/function calls. Some of them are optimized for computing the eigenvalues of a very large matrix by parallel processing. However, such libraries/function calls are not aimed at accelerating the eigenvalues computation for a lot of small matrices. In our GPU implementation, we considered programming issues of the GPU architecture including warp divergence, coalesced access of the global memory, utilization of the shared memory, and so forth. In particular, we present two types of assignments of GPU threads to matrices and introduce three memory arrangements in the global memory. Furthermore, to hide CPU-GPU data transfer latency, overlapping computation on the GPU with the transfer is employed. Experimental results on NVIDIA TITAN~X show that our GPU implementation attains a speed-up factor of up to 83.50 and 17.67 over the sequential CPU implementation and the parallel CPU implementation with eight threads on Intel Core i7-6700K, respectively.
GPGPU; CUDA; eigenvalue problem; bulk computation; QR algorithm