High Performance Compute platforms are increasingly heterogeneous, with upcoming exascale platforms using heterogeneous processors (e.g., Intel Sapphire Rapids, Nvidia Grace, AMD EPYC processors) that include vector engines, matrix engines, and heterogeneous cores coupled with compute accelerators from a variety of vendors (e.g., Intel Xe HPC/Ponte Vecchio, Nvidia A100, AMD MI100). Developing applications that can compile and run efficiently across the wide range of compute environments is a massive challenge. It is impractical to maintain distinct optimized code versions for each application for every possible hardware architecture. Vendors often provide optimized matrix libraries to efficiently utilize their hardware (e.g., OneAPI MKL, cusparse, hiSparse, etc.). A tensor is a many-dimensional generalization of a matrix. Tensors are fundamental building blocks in a wide range of high performance compute applications including Artificial Intelligence (e.g., Deep Neural Networks) and Numerical Model and Simulation (e.g., Finite Element codes). There is broad support for well-tuned, ultra-high-performance matrix operations (including vendor, commercial, and research implementations). However, tensor operations do not enjoy the same broad support for highly-tuned architecture-aware implementations. For instance, standardized libraries for high-performance tensor computations are not available. For instance, there is no AMD or Intel GPU functionality comparable to Nvidia’s cuTensor library for tensor contractions. Therefore, high productivity, performance portable software development frameworks are critical to leverage upcoming high-performance computing architectures.

Kokkos is a popular production-level framework to develop performance portable codes, developed at the U.S. Department of Energy’s Sandia National Laboratories. It is used in over 100 software components and applications to achieve performance portability on at least 5 of the top 10 supercomputers. Users write their application using Kokkos abstractions (e.g., patterns, policies, and spaces) and performance portability is automatically enabled across Kokkos supported hardware platforms, including multi-core CPUs and GPUs. The base Kokkos framework is complemented by the KokkosKernels library for a collection of key sparse matrix operations, again offering users performance portability across hardware platforms. By using Kokkos and KokkosKernels, users also do not have to be concerned with portability across different compute architectures. Kokkos/KokkosKernels currently does not provide support for efficient tensor operations, such as that provided by Nvidia's cuTensor library. The addition of tensor support in KokkosKernels will be beneficial to many DOE and commercial applications, such as higher-order finite element discretizations that are being used in hypersonic reentry, mechanics, and fire simulations at Sandia. There are other examples in solvers for matrix-free multigrid methods and several other use cases across for broader DOE complex. As the breadth of high performance compute architectures permeate industry and developer workstations, the need for industry codes to follow suit is expected to increase.

Therefore, there is an opportunity to provide a Kokkos Tensor distribution that includes a tensor API and architecture-aware optimizations. The research is developing an optimized KokkosTensor API that supports tensor transpose and tensor contractions, as well as optimization of tensor expressions involving tensor contraction and other element-wise tensor operators. The interface will support expressions defined in Einstein notation. The implementations will utilize the Kokkos primitives to enable performance portable codes, will leverage high-performance vendor libraries where they exist (e.g., cuTensor), and will include architecture-aware tuned implementations where appropriate (e.g., Nvidia, AMD, and/or Intel accelerators).