The Cray Linux Environment sometime uses the Torque scheduler and the ALPS launcher for spawning jobs. In this environment, you typically write a batch script, and submit it with the qsub command. The script invokes the ALPS tool aprun to launch your binary across the cluster.
By default aprun optimizes the launch by pushing your supplied application binary to a ram disk on each compute node. This staging mechanism improves the launch time for normal runs.
When aprun launches the sampler from Freja, the primary sampler binary also enjoys this staging, but unfortunately ALPS does not know about the other Freja binaries, or even your application, so these binaries are not staged. Consequently, execution fails when these applications cannot be located.
In the Cray environment, you must rewrite your batch script to invoke aprun with the -b flag. This inhibits the staging mechanism altogether, and the normal rules for launching applications are used again.
The aprun command accepts the following options:
| ||
--From: aprun(1) manual page |
The proper way to launch a Freja sampling in a Cray Linux Environment with Torque and ALPS is to create a batch file:
my-job.pbs
:
#!/bin/bash #PBS -N my-job #PBS -l mppwidth=64 #PBS -l mppnppn=32 #PBS -l walltime=00:10:00 # set PATH to include the Freja bin directory PATH=$PATH:installation_directory
/bin # change directory where the job was submitted from cd$PBS_O_WORKDIR
aprun -b -n 64 -N 32 sample -g 1 -o my-job-samplefiles/process-%r.smp \ -r ./my-job arg1 arg2
and invoke this script using:
$
qsub my-job.pbs
Sample files appear in the
directory $PBS_O_WORKDIR/my-job-samplefiles
.
If you only launch one rank, omit -g 1, or change it into -g 0