The observational and experimental data collected by research agencies has undergone a revolutionary change in recent years. The ever-expanding breadth and fidelity of these scientific datasets have allowed scientists to tackle real-world physical problems in unprecedented ways. While the high-fidelity data generated is a valuable resource for research and scientific exploration, it also presents a number of challenges, the most significant of which is the sheer volume and complexity of the data, which includes information from sources such as satellites, telescopes, spacecraft, and numerical simulations.

State-of-the-art data compression algorithms are not good enough to enable long-term efficient data storage for peta-byte sized datasets resulting from numeric simulations. DLS is a novel data compression framework that enable over 100:1 compression ratios for scientific datasets while still maintaining suitable error bounds.

This research is developing an optimized KokkosTensor API to support tensor transpose, tensor contractions, as well as optimization of tensor expressions involving tensor contraction and other element-wise tensor operators. The implementations will utilize the Kokkos primitives to enable performance portable codes, will leverage high-performance vendor libraries where they exist (e.g., cuTensor), and will include architecture-aware tuned implementations where appropriate (e.g., Nvidia, AMD, and/or Intel accelerators).

The project is designing a flexible Kalman Filter code generator that produces highly optimized Kalman filter code given only a set of physics equations, a target architecture, and a description of the use case. Utilizing operator fusion, operator specialization, optimizations for problem size and sparsity, and with support for batch processing and distributed architectures, the framework will generate Kalman filtering code that is designed to suit the specific features and requirements of the given problem.

This software framework will enable developers to use standard Machine Learning Frameworks (MLF) to create no-compromise neural network architectures that can be applied to entire WSIs (Whole Slide Images) without consideration of the GPU memory limitations. This software will analyze the graphs produced by MLFs, partition and allocate portions ofthe input images and neural network layers to distributed GPUs, and create a workflow and data movement planto optimally leverage the compute power of the GPUs for deep learning networks that are processing very largeimages. This will allow developers to create more natural designs that have better accuracy than patch-based approaches, and to avoid creating human markups of images, relying instead on large pathology archives that have abundant diagnostic and outcome labels.

PerfDev is an intuitive, full featured and collaborative automated optimization and code analysis framework that promotes and enables on-the-fly performance optimization of advanced scientific applications to maximize code development and application efficiency.

ES-SciLA (Extreme Scale Linear Algebra for Science Applications) targets high performance and scalability on a range of high performance (and exascale) compute architectures for diverse science applications (i.e., Machine Learning and PDE solvers). The ES-SciLA library will be implemented and optimized for multiple compute architectures, e.g., clusters, multi-core processors, many-core processors/accelerators, and GPU accelerators. ES-SciLA will be sparsity adaptive library, using an adaptive tiled implementation to optimize performance across a range of application types, architectures, and matrix sparsity patterns. 

Computational chemistry codes such as NWChem provide vital simulation capabilities to a wide range of scientists. These codes require significant compute resources, and are among the key applications being targeted for upcoming Exascale system at the DOE. Optimizing Tensor operations is a key aspect to reducing the computational requirement of these codes.

Tensor libraries are required that include Tensor Operation Minimization (OpMin) techniques that include specializations and usability support for computational chemistry codes (e.g., auxiliary indices, projected atomic orbitals, density fitting bases, etc.), and optimized code generation for efficient heterogeneous computing with small-sized tensors.

TensorOpt is a tensor optimization framework that will optimize tensor computations to will improve computational chemistry software developer productivity in implementing new quantum chemistry models for efficient execution on emerging Extreme Scale architectures.

The two primary features of TensorOpt are:

  • An OpMin tool for operation minimization of tensor contraction expressions, and
  • A collection of tools for efficient execution of batches of small-sized tensor contractions.

CloudBench is a hosted simulation environment for large scale numeric simulations. CloudBench:NE is the application of CloudBench to Nuclear Engineering simulation tools. CloudBench will augment existing simulation, Integrated Development Environment, and workbench tools being developed by the DOE and industry. It offers a complete set of simulation management features not available in existing tools;

  • sharing of configurations, simulation output, and provenance on a per simulation or per project basis (ensuring that export control and license restricts are maintained),
  • hosted versions of advanced simulation codes (removing the need for end user installation),
  • multi-simulation provenance history to allow simulations to be reconstructed, verified, or extended,
  • remote access to simulation tools installed on Cloud and HPC resources, and
  • commercial support for a workbench that supports open sourced code, government sourced codes, and multiple vendor workflows.

V&V is a discrete process that cannot account for each and every possibility. This raises issues in the development of general purpose numerical simulation packages because, while it is the simulation software developers responsibility to ensure the product is mathematically correct, it is ultimately the responsibility of the end user to ensure the solution is a suitable representation of their physical model. After all, the direct costs of a design failure (be it time, money, or loss of life) fall squarely on the shoulders of the end-user; any attempt to shift the blame to the developers of simulation library X will certainly fall on deaf ears.

The importance of graph applications is growing as society becomes more interconnected. Many real world datasets are best modeled as graphs, e.g., road networks, the Internet, social networks, protein-protein interactions, utility grids, and communication. Graph analytics be used to answer many different categories of questions, including traversal, querying, and data mining. It is often desirable to execute graph analytics applications on a wide range of hardware platforms (from supercomputers to mobile devices). The equivalence between a graph and the associated sparse matrix encoding a graph's adjacency list has prompted considerable interest in casting various graph algorithms in terms of sparse matrix operations. We propose to identify a set of high-level matrix and graph primitives and to create efficient implementations of primitives for multiple platforms. This will serve the basis for a performance portable graph framework. A runtime graph analysis engine will also be developed and integrated into the framework (to select the best algorithms and data structures based on the graph and compute hardware). This will allow graph analytics experts to develop performance portable graph applications without concern for the low-level optimizations required to target their application to a particular hardware platform for multiple graph layouts.

SolverSelector is an add-in feature for large numerical softwares that improves the performance of simulations both in terms of execution time and reliability, and promotes portability across a range of computing platforms. This software feature enable numerical softwares to automatically choose the optimal solvers at runtime with minimal runtime overhead based on the evolving problem characteristics and the underlying compute platform. The optimal solver selection is based on machine learning models developed using supervised learning techniques.

MapReduce is a very popular data analytic framework that is widely used in both industry and scientific research. Despite the popularity of MapReduce, there are several obstacles to applying it for developing some commercial and scientific data analysis applications. iNFORMER is a “A MapReduce-like Data-Intensive Processing Framework for Native Data Storage and Formats” that addresses many of the drawbacks of using MapReduce on scientific data.

The framework allows MapReduce-like applications to be executed over data stored in a native data format, without first loading the data into the framework. This addresses a major limitation of existing MapReduce-like implementations that require the data to be loaded into specialized file systems, e.g., the Hadoop Distributed File System (HDFS). The overheads and additional data management processes required for this translation can prevent MapReduce from being used in many commercial and scientific environments. 

Read more ...

The goal of this project was to develop optimizations to improve the performance and scalability of government codes, particularly Kestrel (a Create 3D code developed by the U.S. AF). Kestrel is a CREATE simulation tool for virtual fixed wing aircraft simulation. In order to leverage the computational capabilities of new and emerging architectures, such codes must be redesigned to exploit distributed memory, shared memory, and SIMD parallelism. This effort focused on the optimizations of the CFD solver and the FSI in Kestrel.

The primary accomplishments of this work included:

  • The development and implementation of a multi-tiered Gauss Seidel kernel that increases the number of operations performed before cache misses. This addresses the issues of low intra-node shared memory performance due to memory bandwidth. The scheme is also used to improve the scalability across multiple distributed memory nodes.
  • SIMD optimizations to the CFD solver to further improve the performance of Kestrel. This includes the refactoring of data structures and the implementation of the optimizations in the Gauss Seidel implementation.
  • The exploration of GPU and Xeon Phi kernels for the Gauss Seidel (or similar) solver in Kestrel. The host/accelerator bandwidth is a major limiting concern for the existing Gauss Seidel implementation.
  • The exploration of the expected performance of KSP solvers from alternate linear algebra libraries. These could have the benefit of increased local work to improve performance. However, for the example Kestrel uses cases, these schemes do not exhibit a significant performance improvement due to the good convergence of Gauss Seidel for these test cases.
  • Exploration of an ADI like scheme for unstructured solvers that reduces the number of iterations required for convergence as compared to Gauss Seidel.
  • Implementation and demonstration of a sub-iteration free scheme for the FSI interaction. This addresses a significant bottleneck in the a coupled Kestrel solver.
  • SIMD optimization of the FSI and CSD implementation to further improve the SMP performance of Kestrel.

The VERA Workbench is a Python-based platform that offers an IDE like environment for the various VERA tools to be used. RNET was exploring the workbench on a Phase I SBIR from the Department of Energy. The VERA Workbench would be a standalone platform that supports automatic code setup, installation and execution for various computational codes in the VERA suite of tools. We will add support for integrating with cloud-based services such as AWS and DOE clouds (e.g., ORNL’s CADES). The cloud integration will allow commercial, government, and academic nuclear engineering researchers to run coupled-physics simulations with little effort. 

OrFPGA is an empirical performance optimization tool that efficiently explores the ‘user tunable” parameter space of an FPGA design and assists in deducing the near optimal design in terms of timing score, device utilization, and power consumption.

The HDTomoGPR was developed on a DOE Phase II STTR entitled  “Ground Penetrating Radar System (GPR) and Algorithms for Fine Root Analysis” from the DOE (Department of Energy). GPR technology is a superlative choice for non-destructive imaging and analysis of roots. Current GPR-based root analysis techniques are designed for coarse root analysis and are unable to provide an accurate image of the root structure. The focus of this project is to develop an unconventional GPR system designed with greater accuracy, penetration depth, and imaging capabilities will extend the GPR technology for more accurate root analysis.

RNET’s PowerAware platform provides fine-grained power monitoring capability to application developers. The platform performs high-frequency, fine-grained power and energy profiling for all system components. The profiling correlates the power and energy profiling to fine-grained application phases. This allows the application to monitor and tune its power and energy footprint on modern power constrained strained large-scale computing systems.

RAPID Radar is a wide area aerial surveillance tool to identify and map the powerlines in a Power Grid. The tool can be used to get a scan of a healthy system or more importantly in the aftermath of a catastrophic event.

RNET Technologies has developed a video compression analysis system named Virtual Object Based Compression (VOBC) that includes many attractive options, e.g., real-time compression, AES encryption, streaming video mode, synchronized audio and video, object tracking, and 3-D stereo function capabilities. The compression performance of this new system is much better than that which can be achieved with MPEG-2 DVD technology and is comparable to that achieved by MPEG-4 AVC standard but at a relatively lower cost of implementation, both in hardware and software.

RNET had worked on chip and board level research and development projects, but is no longer actively pursing work in this area. Work included research into the development of novel FPGAs (radiation hardened and ultra-low power designs), read out integrated circuits (radiation and temperature hardening, and developments for strained layer superlattice photodetectors), and a Fault Tolerant Mid-Wave Infrared (MWIR) detector.