This page is here to provide a collection of some of the information I have been able to glean from the ALCF websites regarding Intrepid. If you find any other useful information, please pass it my way so that I may keep this up to date.

Intrepid Details
Filesystem and Storage
Compute Node Partitions
Submitting Jobs
Monitoring Jobs

Intrepid Details

[top]

Argonne National Lab's Intrepid supercomputer is an IBM BG/P system with 40960 quad-core compute nodes. Each compute node has 2GB of memory shared in various ways among the four 850MHz Power PC 450 processors. The way in which the memory is shared depends on the mode in which the process is executed: 1) SMP node mode allows for a single task with 4 threads per node, 2GB memory per task, 2) DUAL node mode allows for two tasks with 2 threads each (total of 4) per node, 1GB memory per task, and 3) VN (Virtual Node) node mode allows for four tasks with 1 thread each (total of 4), 512MB memory per task.

BG machine login nodes run a Linux-like kernel while the compute nodes run a very light-weight kernel called the Compute Node Kernel (CNK). Each time a partition is selected for production it is rebooted and the CNK is reloaded. The CNK has a limited set of Linux-like system calls. Because of the disparity in kernels, code that is intended to run on the compute nodes must be compiled with a cross-compiler on the login nodes. The cross-compilers are recognized by having bg at the beginning of their names. Furthermore, there are MPI wrappers for all of these compilers.

Filesystem and Storage

[top]

The /gpfs1/user_name/ directory is currently what I have been using as the equivalent of the $SCRATCH directory on other systems. I have yet to find any information about quotas on the Intrepid system. There is a /scratch directory where the user may create their own user_name directory for storing files. Note that this directory (as far as I can tell) lives on the login nodes and therefore executables can not be run from this directory.

Compute Node Partitions and Charging

[top]

The compute nodes on intrepid are divided into partitions, the smallest of which contains 64 compute nodes (i.e. 256 cores). All larger partitions are in multiples of 2 of the smallest partition. When a user submits their job they do not need to specify which partition to run the job on; the submission script will determine the size of the partition to use based on the specified number of processes and the specified mode (SMP, DUAL, VN). This means that if you wanted to run on only 30 cores in VN mode (i.e. 7.5 compute nodes worth), the smallest partition you could use would be the 64 node partition and therefore you would be charged for all 64 nodes. Additionally, partitions can not be combined - if the user needs a total of 768 VN cores (a total of 192 nodes), the system will not combine a 64 node partition with a 128 node partition. Instead, the smallest partition which contains the number of cores required is selected; in this case a 256 node partition and the user would still be charged for the unused 64 nodes. As an example, the 30 core job mentioned above, running for 5 hours would cost:

64(nodes) * 4(cores/node) * 5(hours) = 1,280 core hours.

As another example, suppose a user needs to run a 1500 process job for 20 hours and each process requires 900MB of RAM. The memory constraints imply that the user should run in DUAL node mode (2 processes/node) with a total of 750 nodes. As 750 is not a power of 2, the smallest size partition would be one containing 1024 nodes. Therefore the user would be charged:

1024(nodes) * 4(cores/node) * 20(hours) = 81,920 core hours.

A key point of these examples is that the cpu hour cost for a job is always (# nodes) * 4 * (# hours) and that memory requirements constrain the number of nodes the user must run on.

Submitting Jobs

[top]

Currently, there is no batch submission system on Intrepid - all job submissions have to occur through a Python command line script called qsub. The calling sequence typically is:

qsub -t time -n num_nodes [more options] name_of_executable [executable's arguments]

The following table lists a few of the important arguments to this script:

option argument description
-t time specifies the number of minutes to run the executable;
alternatively one can use the HH:MM:SS format.
-n num_nodes specifies the number of nodes requested
--mode mode_type specifies the type of node mode to use.
mode_type can be either dual or vn.
omitting this option defaults to SMP mode.
-q queue_name specifies which queue to use
queue_name is commonly prod-devel for debugging or
prod for a production run.

The combination of the -n and --mode options will determine the total number of cores used in the run. The examples given in the Compute Node Partitions section above would be submitted to the production queue as

qsub -t 300 -n 64 -q prod name_of_executable [executable's arguments]

and

qsub -t 1200 -n 750 --mode vn -q prod name_of_executable [executable's arguments]

respectively. Note that in the second example, one does not need to specify 1024 as num_nodes; the qsub script will automatically select the 1024 node partition as it is the smallest partition that contains the 750 nodes. There are other special queue names for when a user has contacted ANL for a reserved time session. These special queues are given unique names and are sent to the user who made the request.

Currently, the prod-devel queue is restricted to a maximum of 1 hour run time. Also, any job wanting to use the prod queue must use a partition size of 512 nodes or greater. I'll update these queue restrictions as they become known.

Monitoring Jobs

[top]

To view the queued jobs issue the qstat command. The user can change the output of the qstat command by changing the QSTAT_HEADER environment variable. I added

export QSTAT_HEADER=JobId:User:WallTime:RunTime:Nodes:Mode:State:Queue

to my .bashrc file to limit the output. Additionally, users can view how much computer time has been used from their allocation by issuing the cbank command.