Accelerating Sparse Matricized Tensor Times Khatri Rao Product on Massively Parallel Architectures

Andy Nguyen

Tensor decomposition is a popular data analysis technique that extracts latent information from multi-dimensional data. High performance implementations of tensor decomposition face a number of challenges, such as efficiently storing sparse tensor data, achieving optimally load-balanced parallelism, and addressing the resource limitations of GPU accelerators. This thesis presents Blocked Linearized CoOrdinates (BLCO), a novel sparse tensor format built to accelerate tensor-related computations on massively parallel GPU architectures. In contrast to prior approaches, BLCO enables efficient out-of-memory computation of tensor algorithms using a unified implementation operating on only a single in-memory tensor copy. Our linearization and adaptive blocking strategies not only meet the resource constraints of target GPU devices but also accelerate data indexing, eliminate control-flow and memory-access irregularities, and reduce kernel launch overhead, all while maintaining a high compression rate. To address the substantial synchronization costs on GPUs, we introduce an opportunistic conflict resolution algorithm, in which threads collaborate to discover and resolve conflicting updates on-the-fly, at successive levels of the memory hierarchy. Overall, our tensor decomposition framework delivers superior in-memory performance compared to prior state-of-the-art GPU implementations, and it is the only framework currently capable of processing large scale, out-of-memory tensors. On the latest Intel and NVIDIA GPUs, BLCO achieves 2.12x - 2.6x geometric mean speedup over the prior state-of-the-art (Mixed-Mode Compressed Sparse Fiber) and demonstrates consistent performance across a gamut of representative real-world tensor data.