Computational chemistry codes such as NWChem provide vital simulation capabilities to a wide range of scientists. These codes require significant compute resources, and are among the key applications being targeted for upcoming Exascale system at the DOE. Optimizing Tensor operations is a key aspect to reducing the computational requirement of these codes.

Tensor libraries are required that include Tensor Operation Minimization (OpMin) techniques that include specializations and usability support for computational chemistry codes (e.g., auxiliary indices, projected atomic orbitals, density fitting bases, etc.), and optimized code generation for efficient heterogeneous computing with small-sized tensors.

TensorOpt is a tensor optimization framework that will optimize tensor computations to will improve computational chemistry software developer productivity in implementing new quantum chemistry models for efficient execution on emerging Extreme Scale architectures.

The two primary features of TensorOpt are:

  • An OpMin tool for operation minimization of tensor contraction expressions, and
  • A collection of tools for efficient execution of batches of small-sized tensor contractions.

The new OpMin tool will be structured in three layers as shown in the figure. The innermost layer, OpMin Term, will optimize a single product term of tensor contractions by rewriting it as a sequence of binary tensor contractions. The middle layer, OpMin Exp, will optimize a collection of tensor expressions involving tensor products and tensor sums. The outer layer, OpMin Iter, will optimize iterative loops containing tensor contraction expressions, where some of the tensors are invariant and others are modified in each iteration. Because some tensors are invariant in an iterative computation, the optimal sequence of binary tensor operations for the iterative context can be quite different.

The batched small tensor framework is directed at support for computation on small sized tensors. In contrast to the large dense tensors used in the current implementations of accurate quantum chemistry methods like the coupled cluster methods, the recently developed computationally efficient linear scaling models operate on block-sparse tensors with relatively small block sizes. Therefore, instead of efficient routines to contract large dense tensors, we need efficient routines to contract large batches of small tensor blocks. For matrices, batched-BLAS routines have been implemented in vendor numerical libraries like Intel’s MKL and Nvidia’s cuBLAS. Higher performance can be achieved when a batch of independent small-matrix multiplications is collectively performed, compared to invoking individual function calls for each matrix product. Similar software is needed for batched contractions of small sized tensors, but such a functionality is not available in any vendor library, or even in any research prototype to our knowledge.

As we elaborate later in the proposal, there are two broad approaches to performing a tensor contraction: i) a direct implementation that is implemented through nested loop code that performs the collection of elementary arithmetic operations to produce the output tensor, and ii) the indirect TTGT (Transpose-Transpose-GEMM-Transpose) approach that performs appropriate index permutation (transposition) on the input tensors so that a tuned vendor library routine for matrix multiplication (GEMM) can be used to perform all the needed arithmetic operations. The former approach requires development of a code generator for synthesizing customized efficient tensor contraction code for the target platform for each tensor contraction needed in a tensor contraction expression. The latter approach can be effectively implemented by development of a library function for batched tensor transposition. We plan to leverage both approaches, by developing BatchTT tools for batch tensor transposition and TenGen for generation of direct loop code for the contraction. In conjunction with the TenGen and BatchTT tools to enable high-performance computing on batches of tensor contractions, we will also develop tailored performance prediction models for the two modes of batched tensor contractions (for rapid evaluation in online usage modes) . We are exploring different machine learning approaches, including simple regression, decision trees, and neural networks.