Performance Portability of Sparse Tensor Decomposition Operations

Sean Isaac Geronimo Anderson; Jee Choi

We leverage the Kokkos library to study the performance portability of parallel sparse tensor decompositions on CPU and GPU architectures. Real-world multi-way data can be represented using a multi-dimensional array, or tensor, and tensor rank decomposition can reveal latent information within data. Tensors storing real-world data are often large and sparse, necessitating space-efficient storage and time-efficient parallel algorithms. CANDECOMP/PARAFAC via Alternating Poisson Regression Multiplicative Update (CP-APR MU) is a memory bandwidth--bound algorithm which calculates tensor rank decomposition for count data, and which is composed of simple array operations. We compare the performance of Kokkos implementations of three kinds of kernels (simple array operations, MTTKRP, and CP-APR MU) to platform-specific implementations on CPUs and GPUs.

Our result shows that with a single implementation Kokkos can deliver performance comparable to hand-tuned code for simple array operations that make up tensor decomposition kernels on a wide range of CPU and GPU systems, and superior performance for the MTTKRP kernel on CPUs, but exhibits comparable to lower performance in the case of the CP-APR MU kernel on CPU systems.