Local documentation
This document describes the use of NQS on the Cray. Other documentation on the Cray system is available from UWO ITS
Introduction
The batch system is called the Network Queueing System or NQS. On the Cray, you can find detailed information about NQS in the Network Queueing System (NQS) User’s Guide, which is available in printed format and in docview. See Cray Documentation for details on how to access documentation on the Cray. You can get information via the man pages, e.g. man qsub
To run a job in the batch, you prepare a script file, as described in Making a Script File. You then submit this file to NQS with the qsub command, using various options as described in various sections of this document. You can check the status of you command with the qstat command, described in the section The qstat Command.
Making a Script File
A script file contains the commands needed to execute your program. The following is a simple scriptfile, that executes the program doit which resides in your subdirectory progs:
cd progs
./doit
Note that you have to cd to the right directory, as the job will start in your home directory. If the name of the file is myjob, you submit the file with the command qsub:
[12:42pm hpc] qsub myjob
nqs-181 qsub: INFO
Request 3116.hpc.uwo.ca>: Submitted to queue by ..;
This job will produce two files, which names like myjob.o1234 and myjob.e1234. Here 1234 is some number. The .o1234 will contain the output of your job, as would normally be written to standard output, i.e. your screen when running the job interactively. The .e1234 file will contain the error messages generated by your job, as written to standard error, which interactively also is your screen.
Creating a Log-File
Continuing with the example above, it would be preferrable to do the following regarding the output:
- Write the output to a file with a name of your choosing
- Write standard output and standard error to the same file
- Write this file as the job processes, so that you can look at it even when the job is not yet finished.
You accomplish this as follows:
#QSUB -o myjob.log -eo -re -ro
cd progs
./doit
Any line starting with #QSUB is taken to contain directives for NQS. An option has the form -xxx, possibly followed by a value.
The meaning op the options above is as follows:
-o myjob.log This writes the output to the file you specify, in this case myjob.log. If you do not specify a directory, the file wil end up in the directory you were in when you gave the qsub command.
-eo Combine standard output and standard error into one file.
-re Write standard error as the job progresses.
-ro Write standard output as the job progresses.
On which Queue Does the Job Run?
You do not explicitly state on which queue a job should run: this is determined by the resources, to wit CPU-time and memory, you request.
There are three memory sizes — small, medium and large — and three CPU_time spans — short, regular and long, defined as follows (note that on the cray we use megawords for memory):
Cray | |
Memory (mw = megaword, where one word is 8 bytes; MB = megabyte) | |
small | = 8mw |
medium | = 64mw |
large | = 128 mw |
CPU-time | |
short | = 3600 sec (1 hour) |
regular | = 28800 sec (8 hours) |
long | unlimited |
If you do not specify any resources, the job ends up in small_short, which means it will be aborted if it takes more than 300 CPU-seconds. It is therefore a good idea to specify resource requirements:
#QSUB B -o myjob.log -eo -re -ro
#QSUB -lT 350 -lM 20mw
cd progs
./doit
This sets the CPU-time limit to 350 seconds and the memory limit to 20 megawords on the Cray (if you do not specify mw as units, bytes will be used.)
How Many Jobs will Run at the Same Time?
There are limits to the amount of jobs that can run concurrently in the batch. These limits are based on the available amount of memory (256mw on the Cray) and the available amount of swap space. If the sum of the memory that all running jobs are using is greater than the available memory, one or more jobs have to be swapped out.
One should remember that the Cray is not a virtual memory machine, so the option of paging out just parts of jobs does not exist on that machine, the whole job has to be swapped out. This is obviously an I/O bound process, and too much swapping will seriously degrade system performance. The limits on the number of concurrently running batch jobs ensure that the “oversubscription” is kept to an acceptable level.
Another consideration is that long jobs should not be able to exclude short jobs from running. In other words, the total number of slots for long jobs should be less than the total for short jobs. This will ensure that one or more short jobs will always be able to run, thus providing a continuous throughput of short jobs.
To achieve this, the total number of jobs that can run in any of the memory catagories (small, medium, large) has been set as follows.
small | 10 jobs total |
medium | 4 jobs total |
large | 3 jobs total |
On the Cray this means that the total amount of memory required will be at most 108mw + 464mw + 3*28mw = 720mw, which should be an acceptable oversubscription rate given our swap space.
To make sure that one or more short jobs can always run, the limits for each individual queue have been set as follows:
short | regular | long | catagory | |
small | 6 | 3 | 2 | 10 |
medium | 3 | 2 | 1 | 4 |
large | 2 | 1 | 1 | 3 |
Notice that regular+long
This can be exemplified as follows. If there are 6 jobs running in smallshort, 3 in smallregular and one in smalllong, then another job submitted to smalllong will not start until one of the already running jobs has finished, even though the run limit of smalllong has not been exceeded. Additionally, at most 5 non-short jobs can run in small, so that 5 shortsmall jobs can always run. In medium and large, at least one short job can always run.
The qstat Command
You can check the status of the queues in general and of your jobs in particular with the qstat command. You check the general status of the batch queues as follows:
[12:05pm hpc] qstat -b
---------------------------
NQS 1.1 BATCH QUEUE SUMMARY
---------------------------
QUEUE NAME LIM TOT ENA STS QUE RUN WAI HLD ARR EXI
----------------------- --- --- --- --- --- --- --- --- --- ---
small_short 6 0 yes on 0 0 0 0 0 0
small_regular 3 0 yes on 0 0 0 0 0 0
small_long 2 0 yes on 0 0 0 0 0 0
medium_short 3 0 yes on 0 0 0 0 0 0
medium_regular 2 0 yes on 0 0 0 0 0 0
medium_long 1 1 yes on 0 1 0 0 0 0
large_short 2 0 yes on 0 0 0 0 0 0
large_regular 1 0 yes on 0 0 0 0 0 0
large_long 1 0 yes on 0 0 0 0 0 0
dedicated 1 0 no off 0 0 0 0 0 0
----------------------- --- --- --- --- --- --- --- --- --- ---
hpc.uwo.ca 20 1 0 1 0 0 0 0
----------------------- --- --- --- --- --- --- --- --- --- ---
In this example, there is only one batch job running, in medium_long. For a summary of the status of your jobs, do:
[12:42pm hpc] qstat -a
-----------------------------
NQS 1.1 BATCH REQUEST SUMMARY
-----------------------------
IDENTIFIER NAME USER QUEUE JID PRTY REQMEM REQTIM ST
------------- ------- -------- --------------------- ---- ---- ------ ------ ---
3116.hpc.uwo.ca myjob gerard small_short@hpc.uwo.ca 27398 20 222 291
To get an overwhelming amount of details about your jobs, use qstat -af:
[12:53pm hpc] qstat -af
-----------------------------------
NQS 1.1 BATCH REQUEST: q.hpc.uwo.ca Status: RUNNING
----------------------------------- 2 Processes
Active
NQS Identifier: 3120.hpc.uwo.ca User: gerard
Group: cast
Account:
Priority: ---
URM Priority Increment: 1
Job Identifier: 27414 Nice Value: 20
Created: Fri Jul 12 1996 Queued: Fri Jul 12 1996
12:53:36 EDT 12:53:38 EDT
Name: small_short@hpc.uwo.ca Priority: 60
PROCESS LIMIT REQUEST LIMIT REQUEST USED
CPU Time Limit 300sec> 20sec 11sec
Memory Size 8mw> 8mw> 498kw
[plus much more]
If you just want to see how much CPU-time has been used, do qstat -af | grep CPU .
Checkpointing
A checkpoint is an image of a job that reflects its current state. If the batch queues (or the whole system) get shut down, and jobs are running, checkpoints of all running jobs are taken. This allows the jobs to be restarted when things come up again.
Even if the system crashes, you should never loose more than 30 minutes of CPU time. If the system crashes and there is no checkpoint, e.g. because the job has run less than 30 CPU-minutes, the job is restarted when things come up again.
You can disable checkpointing with the #QSUB -nc option, and you can disable automatic restart with the #QSUB -nr option.
You can change the checkpoint interval with the qalter command:
[3:55pm hpc] qalter -c c=15 3520
nqs-2700 qalter: INFO
Request 3520.hpc.uwo.ca>: Altered by /font>
Here 3520 is the request id. It shows up in qstat -a as 3520.hpc.uwo.ca (it is also displayed when you qsub the job). -c c=15 means checkpoint every 15 CPU-minutes. If you had specified -c w=15 the job would have been checkpointed every 15 wall-clock (i.e. real, elapsed) minutes.
You can force a checkpoint with the command qchkpnt. This command is given on the shell level, i.e. not in your fortran or c program. There also is a routine chkpnt which you can call from inside a program. On the Cray you can do man 2 chkpnt for details. this man page also gives you details about under which circumstances a job can/cannot be restarted from a checkpoint.
Removing a Job from the Queues
You remove a job from te queues with the command qdel:
[12:50pm hpc] qdel 3618
This wil remove the job if it is not already running. If it is running you get the warning message:
nqs-462 qdel: WARNING
Request 3620>: is running on local host.
In that case you have to send a signal to kill the job. Usually a HUP signal will do, in emergencies you may have to use signal 9 (but try HUP first:
[12:50pm hpc] qdel -s HUP 3618
nqs-464 qdel: INFO
Request 3618>: has been signalled at local host.
or
[12:50pm hpc] qdel -s 9 3618
nqs-98 qdel: INFO
Request 3618.hpc.uwo.ca>: Deleted by .
This document originally from The University of Western Ontario