Difference between revisions of "ENZO"

From Point
Jump to: navigation, search
(Experiment Scalability)
(Enzo Version 1.5)
 
(30 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
=ENZO Performance Study Summary=
 
=ENZO Performance Study Summary=
  
This is a short overview of the performance result from the ENZO application. For each experiment we used these inits/param files:
+
This page shows the performance result from ENZO (svn repository version). We chose this version in part to see the effects of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are [[EnzoV1Performance | here]].
  
* [http://giusto.nic.uoregon.edu/~scottb/SingleGrid_dmonly.inits inits]
+
==Enzo Version 1.5==
* [http://giusto.nic.uoregon.edu/~scottb/SingleGrid_dmonly_amr.param param]
 
  
This is a relatively small experiment but was sufficient to generate some interesting performance results. For this study we used the [http://tau.uoregon.edu TAU Performance System®] to gather information about ENZO's performance, in particular we are interested in the performance of the AMR simulation at scale. We ran these experiments on NCSA's Intel 64 Linux Cluster (Abe).
+
Following the release of Enzo 1.5 in November '08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1.0.1.  
  
==TAU Measurement overhead==
+
The configuration files used were like these:
Here is a short table listing the run-times for various experiments and the instrumentation overhead observed. Each run was on 64 processors (8 nodes).
 
  
{|
+
* [http://nic.uoregon.edu/~scottb/point.inits.large inits]
|-
+
* [http://nic.uoregon.edu/~scottb/point.param.large param]
! Run Type
 
! Runtime (seconds)
 
! Overhead %
 
|-
 
|Uninstrumented runtime
 
|1072
 
|NA
 
|-
 
|Trace of only MPI event
 
|1085
 
|4.8%
 
|-
 
|Profile of all significant events
 
|1136
 
|6.0%
 
|-
 
|Profile with Call-path information
 
|1196
 
|11.6%
 
|-
 
|Profile of each Phase of execution
 
|1208
 
|12.7%
 
|}
 
  
==Runtime Breakdown on 64 processors==
+
(The grid and particle sizes change between experiments).
  
Here is a chart showing the contribution each function makes to the overall runtime. Notice that MPI communication time takes over 60% of the total runtime.  
+
This chart shows the scaling behavior of Enzo 1.5 on Kraken:
  
[[Image:MeanFunctionLinux.png]]
+
[[Image:EnzoScalingKraken.png]]
  
==Experiment Scalability==
+
Scaling behavior was very similar on Ranger:
  
These chart show the relative efficiency for a grid size of 128^3 and 256^3. Relative efficiency is the measure of how far an run of the application is slower compared to ideal efficiency. In this case, ideal efficiency would mean a doubling of the processor count would reduces the runtime in half.
+
[[Image:EnzoScalingRanger.png]]
  
[[image:Scaling128.png]] [[image:Scaling256.png]]
+
This scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):
  
 +
[[Image:EnzoMeanBreakdown.png]]
  
This is chart show the breakdown in the runtime of different functions across different numbers of processors. MPI communication time, like in the 64 processor case, continues to dominate the runtime--and to an even greater extent when a larger number of processors are involved.  
+
With this much time spent in MPI communication, increasing the number of processors allocated to more than 64 is unlikely to result in a much lower total execution time. Looking more closely at MPI_Recv and MPI_Barrier, we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms in MPI_Barrier. This is much longer than can be explained by communication latencies on Ranger's InfiniBand interconnect. Mostly likely ENZO is experiencing a load imbalance causing some processors to wait for others to enter the MPI_Barrier or MPI_Send.
  
[[image:MeanRuntineAtScale.png]]
+
Next we looked at how enabling load balancing affects performance. This a runtime comparison between non-load balanced (blue) vs. load balanced simulation (red):
  
==Experiment Trace==
+
[[Image:EnzoMeanComp.png]]
This graphic shows how load imbalances causes long wait times for MPI_Allreduce. Some processors are experiencing as much as 8 seconds of wait time per reduce.
 
  
[[Image:trace.png|1000px]]
+
Time spent MPI_Barrier decreased but was mostly offset by the increase in time spent in MPI_Recv.
  
==Experiment Call-Paths==
+
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.
We observe the follow relationships in the experiment call-path:
 
  
* Almost all the time spend in MPI_Bcast is when it is called from MPI_Allreduce.
+
[[Image:EnzoCallpathMpiRecv.png]]
* Almost all the time spend in MPI_Recv is when it is called from grid::CommunicationSendRegion.
 
* Most all the time spend in MPI_Allgather is when it is called from CommunicationShareGrids.
 
* Almost all the time spend in MPI_Allreduce is when it is called from CommunicationMinValue.
 
  
This chart show the details:
+
[[Image:EnzoCallpathMpiBarrier.png]]
  
[[Image:CallpathRuntime3.png]]
+
MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().
  
==Experiment Phases==
+
==Snapshot profiles==
We also looked at ENZO's runtime through each iteration on the main loop in EvolveHierarchy. Here is the CommunicationShareGrids function representing the computation work done during the consecutive loops (time is microseconds)
 
  
[[Image:CommunicationShareGrids2.png]]
+
Additionally, we used snapshot profiling to get a sense of how ENZO's performance changed over the course of the entire execution. A snapshot was taken at each load balancing step such that each bar represents a single phase of ENZO between two load balancing phases. The first thing to notice is that these phases are regular and short at the beginning of the simulation and become progressively more varied in length with some becoming much longer.  
  
Notice that some iterations are involved in writing out the grid (lots of time spent in WriteDataHierarchy).
+
(The time spent before that first load balancing has been removed--mostly initialization)
  
Here is a breakdown, by function, of the time spent over the course of the experiment. Y-axis is exclusive time spend in each function, and X-axis is overall elapsed runtime:
+
For MPI_Recv:
 +
[[Image:EnzoSnapMpiRecvPercent.png|600px]]
  
[[Image:snapshot.png|1000px]]
+
For MPI_Barrier:
 +
[[Image:EnzoSnapMpiBarrierPercent.png|600px]]

Latest revision as of 20:32, 14 July 2009

ENZO Performance Study Summary

This page shows the performance result from ENZO (svn repository version). We chose this version in part to see the effects of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are here.

Enzo Version 1.5

Following the release of Enzo 1.5 in November '08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1.0.1.

The configuration files used were like these:

(The grid and particle sizes change between experiments).

This chart shows the scaling behavior of Enzo 1.5 on Kraken:

EnzoScalingKraken.png

Scaling behavior was very similar on Ranger:

EnzoScalingRanger.png

This scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):

EnzoMeanBreakdown.png

With this much time spent in MPI communication, increasing the number of processors allocated to more than 64 is unlikely to result in a much lower total execution time. Looking more closely at MPI_Recv and MPI_Barrier, we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms in MPI_Barrier. This is much longer than can be explained by communication latencies on Ranger's InfiniBand interconnect. Mostly likely ENZO is experiencing a load imbalance causing some processors to wait for others to enter the MPI_Barrier or MPI_Send.

Next we looked at how enabling load balancing affects performance. This a runtime comparison between non-load balanced (blue) vs. load balanced simulation (red):

EnzoMeanComp.png

Time spent MPI_Barrier decreased but was mostly offset by the increase in time spent in MPI_Recv.

Callpath profiling gives us an idea where most of the costly MPI communications are taking place.

EnzoCallpathMpiRecv.png

EnzoCallpathMpiBarrier.png

MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().

Snapshot profiles

Additionally, we used snapshot profiling to get a sense of how ENZO's performance changed over the course of the entire execution. A snapshot was taken at each load balancing step such that each bar represents a single phase of ENZO between two load balancing phases. The first thing to notice is that these phases are regular and short at the beginning of the simulation and become progressively more varied in length with some becoming much longer.

(The time spent before that first load balancing has been removed--mostly initialization)

For MPI_Recv: EnzoSnapMpiRecvPercent.png

For MPI_Barrier: EnzoSnapMpiBarrierPercent.png