Filesystem and Storage
Compute Node Partitions
Submitting Jobs
Monitoring Jobs
Intrepid Details
[top]Argonne National Lab's Intrepid supercomputer is an IBM BG/P system with 40960 quad-core compute nodes. Each compute node has 2GB of memory shared in various ways among the four 850MHz Power PC 450 processors. The way in which the memory is shared depends on the mode in which the process is executed: 1) SMP node mode allows for a single task with 4 threads per node, 2GB memory per task, 2) DUAL node mode allows for two tasks with 2 threads each (total of 4) per node, 1GB memory per task, and 3) VN (Virtual Node) node mode allows for four tasks with 1 thread each (total of 4), 512MB memory per task.
BG machine login nodes run a Linux-like kernel while the compute nodes run a
very light-weight kernel called the Compute Node Kernel (CNK). Each time a
partition is selected for production it is rebooted and the CNK is reloaded.
The CNK has a limited set of Linux-like system calls. Because of the disparity
in kernels, code that is intended to run on the compute nodes must be compiled
with a cross-compiler on the login nodes. The cross-compilers are recognized
by having bg
at the beginning of their names. Furthermore, there
are MPI wrappers for all of these compilers.
Filesystem and Storage
[top]
The /gpfs1/user_name/
directory is currently what I have been
using as the equivalent of the $SCRATCH directory on other systems. I have
yet to find any information about quotas on the Intrepid system. There is a
/scratch
directory where the user may create their own
user_name
directory for storing files. Note that this directory
(as far as I can tell) lives on the login nodes and therefore executables
can not be run from this directory.
Compute Node Partitions and Charging
[top]The compute nodes on intrepid are divided into partitions, the smallest of which contains 64 compute nodes (i.e. 256 cores). All larger partitions are in multiples of 2 of the smallest partition. When a user submits their job they do not need to specify which partition to run the job on; the submission script will determine the size of the partition to use based on the specified number of processes and the specified mode (SMP, DUAL, VN). This means that if you wanted to run on only 30 cores in VN mode (i.e. 7.5 compute nodes worth), the smallest partition you could use would be the 64 node partition and therefore you would be charged for all 64 nodes. Additionally, partitions can not be combined - if the user needs a total of 768 VN cores (a total of 192 nodes), the system will not combine a 64 node partition with a 128 node partition. Instead, the smallest partition which contains the number of cores required is selected; in this case a 256 node partition and the user would still be charged for the unused 64 nodes. As an example, the 30 core job mentioned above, running for 5 hours would cost:
64(nodes) * 4(cores/node) * 5(hours) = 1,280 core hours.
As another example, suppose a user needs to run a 1500 process job for 20 hours and each process requires 900MB of RAM. The memory constraints imply that the user should run in DUAL node mode (2 processes/node) with a total of 750 nodes. As 750 is not a power of 2, the smallest size partition would be one containing 1024 nodes. Therefore the user would be charged:
1024(nodes) * 4(cores/node) * 20(hours) = 81,920 core hours.
A key point of these examples is that the cpu hour cost for a job is always (# nodes) * 4 * (# hours) and that memory requirements constrain the number of nodes the user must run on.
Submitting Jobs
[top]
Currently, there is no batch submission system on Intrepid - all job
submissions have to occur through a Python command line script called
qsub
. The calling sequence typically is:
qsub -t time -n num_nodes [more options] name_of_executable [executable's arguments]
The following table lists a few of the important arguments to this script:
option | argument | description |
-t |
time |
specifies the number of minutes to run the executable;
alternatively one can use the HH:MM:SS format. |
-n |
num_nodes |
specifies the number of nodes requested |
--mode |
mode_type |
specifies the type of node mode to use.
mode_type can be either dual or
vn .
omitting this option defaults to SMP mode. |
-q |
queue_name |
specifies which queue to use
queue_name is commonly prod-devel for debugging
or
prod for a production run.
|
The combination of the -n
and --mode
options will
determine the total number of cores used in the run. The examples given in
the Compute Node Partitions section above would be submitted to the production
queue as
qsub -t 300 -n 64 -q prod name_of_executable [executable's arguments]
and
qsub -t 1200 -n 750 --mode vn -q prod name_of_executable [executable's
arguments]
respectively. Note that in the second example, one does not need to specify
1024 as num_nodes
; the qsub
script will automatically
select the 1024 node partition as it is the smallest partition that contains
the 750 nodes. There are other special queue names for when a user has
contacted ANL for a reserved time session. These special queues are given
unique names and are sent to the user who made the request.
Currently, the prod-devel
queue is restricted to a maximum of
1 hour run time. Also, any job wanting to use the prod
queue
must use a partition size of 512 nodes or greater. I'll update these queue
restrictions as they become known.
Monitoring Jobs
[top]
To view the queued jobs issue the qstat
command. The user can
change the output of the qstat
command by changing the
QSTAT_HEADER
environment variable. I added
export QSTAT_HEADER=JobId:User:WallTime:RunTime:Nodes:Mode:State:Queue
to my .bashrc
file to limit the output. Additionally, users can
view how much computer time has been used from their allocation by issuing the
cbank
command.