Autonomic Performance Environment for eXascale (APEX)
2.3.1
|
************************************************************************* APEX - Autonomic Performance Environment for eXascale sub-project of: Phylanx - A Distributed Array Toolkit http://phylanx.stellar-group.org HPX - High Performance ParalleX http://stellar.cct.lsu.edu/projects/hpx/ eXascale PRogramming Environment and System Software (XPRESS) http://xstack.sandia.gov/xpress/ ************************************************************************* Copyright 2013-2020 Department of Computer and Information Science, University of Oregon and partner institutions: Sandia National Laboratories (XPRESS) Indiana University (XPRESS) Lawrence Berkeley National Laboratory (XPRESS) Louisiana State University (XPRESS, HPX, Phylanx) Oak Ridge National Laboratory (XPRESS) University of Arizona (Phylanx) University of Houston (XPRESS) University of North Carolina (XPRESS) All Rights Reserved. ************************************************************************* Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the name of University of Oregon (UO) and all partner institutions not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission. The University of Oregon and partner institutions make no representations about the suitability of this software for any purpose. It is provided "as is" without express or implied warranty. THE UNIVERSITY OF OREGON AND PARTNER INSTITUTIONS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE UNIVERSITY OF OREGON OR PARTNER INSTITUTIONS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. ***************************************************************************
One of the key components of the XPRESS project is a new approach to performance observation, measurement, analysis and runtime decision making in order to optimize performance. The particular challenges of accurately measuring the performance characteristics of ParalleX applications requires a new approach to parallel performance observation. The standard model of multiple operating system processes and threads observing themselves in a first-person manner while writing out performance profiles or traces for offline analysis will not adequately capture the full execution context, nor provide opportunities for runtime adaptation within OpenX. The approach taken in the XPRESS project is a new performance measurement system, called (Autonomic Performance Environment for eXascale). APEX will include methods for information sharing between the layers of the software stack, from the hardware through operating and runtime systems, all the way to domain specific or legacy applications. The performance measurement components will incorporate relevant information across stack layers, with merging of third-person performance observation of node-level and global resources, remote processes, and both operating and runtime system threads.
Essentially, APEX is both a measurement system for introspection, as well as a Policy Engine for modifying runtime behavior based on the observations. While APEX has capabilities for generating profile data for post-mortem analysis, the key purpose of the measurement is to provide support for policy enforcement. To that end, APEX is designed to have very low overhead and minimize perturbation of runtime worker thread productivity. APEX supports both start/stop timers and either event-based or periodic counter samples. Measurements are taken synchronously, but profiling statistics and internal state management is performed by (preferably lower-priority) threads distinct from the running application. The heart of APEX is an event handler that dispatches events to registered listeners within APEX. Policy enforcement can trigger synchronously when events are triggered by the OS/RS or application, or can occur asynchronously on a periodic basis.
APEX is a library written in C++, and has both C and C++ external interfaces. While the C interface can be used for either language, some C++ applications prefer to work with namespaces (i.e. apex::*) rather than prefixes (i.e. apex_*). All functionality is supported through both interfaces, and the C interface contains inlined implementations of the C++ code.
While the designed purpose for APEX is supporting the current and future needs of ParalleX runtimes within the XPRESS project (such as HPX3, HPX5), experimental support is also available for introspection of current runtimes such as OpenMP. APEX could potentially be integrated into other runtime systems, such as any of a number of lightweight task based systems. The introspection provided by APEX is intended to be in the third-person model, rather than traditional first-person, per-thread/per-process application profile or tracing measurement. APEX is designed to combine information from the OS, Runtime, hardware and application in order to guide policy decisions.
For distributed communication, APEX provides an API to be implemented for the required communication for a given application. An MPI implementation is provided as a reference, and both HPX3 and HPX5 implementations have been implemented. In this way, APEX is integrated into the observed runtime, and asyncrhonous communication is provided at a lower priority, in order to minimize perturbation of the application.
The direct links to each API are here:
For a complete user manual, please see the APEX documentation.
Support for this work was provided through Scientific Discovery through Advanced Computing (SciDAC) program funded by U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research (and Basic Energy Sciences/Biological and Environmental Research/High Energy Physics/Fusion Energy Sciences/Nuclear Physics) under award numbers DE-SC0008638, DE-SC0008704, DE-FG02-11ER26050 and DE-SC0006925.