Outside of efficient data structure and general algorithm selection, performance optimization and tuning is almost unanimously treated as a one-time, last-minute event to be completed prior to production. However, with the complexity and heterogeneity of HPC resources on the rise, performance optimization of numerical simulations is quickly emerging as a significant bottleneck in the application development life-cycle for large scale science research. At the core of this issue is the fact that performance optimization is an inherently unintuitive task that requires in-depth knowledge of the inner workings of each application.

Optimization is further complicated by the continual modernization of high end compute resources, that must often be leveraged by legacy codes that were originally developed before the modern era of multi-core, accelerator based computing. In order to fully exploit these new architectures, codes must be optimized to fully exploit these architectures.

New tools are required to support porting and optimizing existing NASA applications and to develop new optimized NASA applications for these highly parallel and massively heterogeneous systems.

The PerfDev software development and optimization framework enables stateful runtime decomposition and passive exploration of code blocks. PerfDev will enable advanced parameter space exploration using deep learning. Instead of waiting to evaluate application performance at the end of a full development cycle, PerfDev will enable real-time performance feedback loops that facilitate the optimization of individual blocks of code as the application is developed or ported. The production PerfDev APIs will be integrated into multiple production environments (starting with Jupyter in Phase I), will support a range of performance metric tools and hardware (e.g., PAPI/perfcntr, SONAR, Caliper, etc.), and multiple compute languages (e.g., C/C++, Fortran, Python). Towards that goal, and in an effort to demonstrate the importance, feasibility and usability of the proposed framework, the Phase I effort will focus of the following objectives in order to develop a functional PerfDev prototype:

  • Code Cell Identification and Automatic Modularization for NASA Applications: Defining application "code cells" is a key component to enable a PerfDev optimization analysis. The code cells are the fundamental unit of evaluation and analysis.
  • Distributed PerfDev Checkpoint-Restart: Checkpoint-restart is a fundamental utility that enables the efficient execution and evaluation of system and application parameters during code development. During Phase I, we will integrate support for MPI-supported application level checkpointing into our C/C++ backend in the PerfDev prototype.
  • PerfDev API for Automation Parameter Space Exploration: We will develop an API to enable passive automatic parameter space exploration of code cells.