Running Jobs on the Frontera Compute Nodes

Frontera's job scheduler is the Slurm Workload Manager. Slurm commands enable you to submit, manage, monitor, and control your jobs. Jobs submitted to the scheduler are queued, then run on the compute nodes. Each job consumes Service Units (SUs) which are then charged to your allocation.

Job Accounting

Frontera's accounting system is based on node-hours: one unadjusted Service Unit (SU) represents a single compute node used for one hour (a node-hour). We then multiply by a charge rate that reflects supply and demand for the particular queue or the type of node you use. For any given job, the total cost in SUs is:

SUs billed (node-hrs) = (# nodes) x (job duration in wall clock hours) x (charge rate per node-hour)

For example, a job that runs in the normal queue for two hours using four nodes, will cost 8SUs:

4 nodes * 2 hours * 1.0 = 8SUs

The system tracks and charges for usage to a granularity of a few seconds of wall clock time. The system charges only for the resources you actually use, not those you request. In general, your queue wait time will be less if you request only the time you need: the scheduler will have an easier time finding a slot for the 2 hours you really need than for the 24 hours you request in your job script.

Principal Investigators can monitor allocation usage via the TACC User Portal under "Allocations->Projects and Allocations". Be aware that the figures shown on the portal may lag behind the most recent usage. Projects and allocation balances are also displayed upon command-line login.

To display a summary of your TACC project balances and disk quotas at any time, execute:
login1$ /usr/local/etc/taccinfo    # more current than balances displayed on the portal.

Requesting Resources

Be sure to request computing resources e.g., number of nodes, number of tasks per node, max time per job, that are consistent with the type of application(s) you are running:

  • A serial (non-parallel) application can only make use of a single core on a single node, and will only see that node's memory.
  • A threaded program (e.g. one that uses OpenMP) employs a shared memory programming model and is also restricted to a single node, but the program's individual threads can run on multiple cores on that node.
  • An MPI (Message Passing Interface) program can exploit the distributed computing power of multiple nodes: it launches multiple copies of its executable (MPI tasks, each assigned unique IDs called ranks) that can communicate with each other across the network. The tasks on a given node, however, can only directly access the memory on that node. Depending on the program's memory requirements, it may not be possible to run a task on every core of every node assigned to your job. If it appears that your MPI job is running out of memory, try launching it with fewer tasks per node to increase the amount of memory available to individual tasks.
  • A popular type of parameter sweep (sometimes called high throughput computing) involves submitting a job that simultaneously runs many copies of one serial or threaded application, each with its own input parameters ("Single Program Multiple Data", or SPMD). The launcher tool is designed to make it easy to submit this type of job. For more information:
$ module load launcher
$ module help launcher

Frontera Production Queues

Frontera's Slurm partitions (queues), maximum node limits and charge rates are summarized in the table below. Queues and limits are subject to change without notice. Execute qlimits on Frontera for real-time information regarding limits on available queues. See Job Accounting to learn how jobs are charged to your allocation.

Frontera's new flex queue offers users a low cost queue for lower priority/node count jobs and jobs running software with checkpointing capabilities. Jobs in the flex queue are scheduled with lower priority and are also eligible for preemption after running for one hour. That is, if other jobs in the other queues are currently waiting for nodes and there are jobs running in the flex queue, the Slurm scheduler will cancel any jobs in the flex queue that have run more than one hour in order to give resources back to the higher priority jobs. Any job started in the flex queue is guaranteed to run for at least an hour (assuming the requested wallclock time was >= 1 hour). If there remain no outstanding requests from other queues, then these jobs will continue to run until they hit their wallclock requested time. This flexibility in runtime is rewarded by a reduced charge rate of .8 SUs/hour. Also, the max total node count for one user with many jobs in the flex queue is 6400 nodes.

Table 5. Frontera Production Queues

Queue status as of November 11, 2019.

Users are limited to a maximum of 50 running and 200 pending jobs in all queues at one time.

Queue Name Max Nodes per Job
(assoc'd cores)
Pre-empt Exempt Time Max Job Duration Max Jobs Max Nodes Charge Rate
per node-hour
flex* 128 nodes
(7,168 cores)
1 hour 48 hrs 50 jobs 6400 nodes .8 SU
development 40 nodes
(2,240 cores)
N/A 2 hrs 1 job 40 nodes 1 Service Unit (SU)
normal 512 nodes
(28,672 cores)
N/A 48 hrs 50 jobs 1024 nodes 1 SU
large** 513-2048 nodes
(114,688 cores)
N/A 48 hrs 5 jobs 2048 nodes 1 SU

* Jobs in the flex queue are charged less than jobs in other queues but are eligible for preemption after running for more than one hour.

** Access to the large queue is restricted. To request more nodes than are available in the normal queue, submit a consulting (help desk) ticket through the TACC User Portal. Include in your request reasonable evidence of your readiness to run under the conditions you're requesting. In most cases this should include your own strong or weak scaling results from Frontera.

Accessing the Compute Nodes

The login nodes are shared resources: at any given time, there are many users logged into each of these login nodes, each preparing to access the "back-end" compute nodes (Figure 2. Login and Compute Nodes). What you do on the login nodes affects other users directly because you are competing for the same resources: memory and processing power. This is the reason you should not run your applications on the login nodes or otherwise abuse them. Think of the login nodes as a prep area where you can manage files and compile code before accessing the compute nodes to perform research computations. See Good Citizenship for more information.

Figure 2. Login and Compute Nodes

Login and Compute Nodes
Login and Compute Nodes

You can use your command-line prompt, or the hostname command, to discern whether you are on a login node or a compute node. The default prompt, or any custom prompt containing \h, displays the short form of the hostname (e.g. c401-064). The hostname for a Frontera login node begins with the string login (e.g. login2.frontera.tacc.utexas.edu), while compute node hostnames begin with the character c (e.g. c401-064.frontera.tacc.utexas.edu).

While some workflows, tools, and applications hide the details, there are three basic ways to access the compute nodes:

  1. Submit a batch job using the sbatch command. This directs the scheduler to run the job unattended when there are resources available. Until your batch job begins it will wait in a queue. You do not need to remain connected while the job is waiting or executing. Note that the scheduler does not start jobs on a first come, first served basis; it juggles many variables to keep the machine busy while balancing the competing needs of all users. The best way to minimize wait time is to request only the resources you really need: the scheduler will have an easier time finding a slot for the two hours you need than for the 24 hours you unnecessarily request.

  2. Begin an interactive session using ssh to connect to a compute node on which you are already running a job. This is a good way to open a second window into a node so that you can monitor a job while it runs.

  3. Begin an interactive session using idev or srun. This will log you into a compute node and give you a command prompt there, where you can issue commands and run code as if you were doing so on your personal machine. An interactive session is a great way to develop, test, and debug code. Both the srun and idev commands submit a new batch job on your behalf, providing interactive access once the job starts. You will need to remain logged in until the interactive session begins.

Submitting Batch Jobs with sbatch

Use Slurm's sbatch command to submit a batch job to one of the Frontera queues:

login1$ sbatch myjobscript

Here myjobscript is the name of a text file containing #SBATCH directives and shell commands that describe the particulars of the job you are submitting. The details of your job script's contents depend on the type of job you intend to run.

In each job script:

  1. use #SBATCH directives to request computing resources (e.g. 10 nodes for 2 hrs);
  2. then, list shell commands to specify what work you're going to do once your job begins.

There are many possibilities: you might elect to launch a single application, or you might want to accomplish several steps in a workflow. You may even choose to launch more than one application at the same time. The details will vary, and there are many possibilities. But your own job script will probably include at least one launch line that is a variation of one of the examples described here.

See the customizable job script examples.

Your job will run in the environment it inherits at submission time; this environment includes the modules you have loaded and the current working directory. In most cases you should run your application(s) after loading the same modules that you used to build them. You can of course use your job submission script to modify this environment by defining new environment variables; changing the values of existing environment variables; loading or unloading modules; changing directory; or specifying relative or absolute paths to files. Do not use the Slurm --export option to manage your job's environment: doing so can interfere with the way the system propagates the inherited environment.

Consult the Common sbatch Options table below describes some of the most common sbatch command options. Slurm directives begin with #SBATCH; most have a short form (e.g. -N) and a long form (e.g. --nodes). You can pass options to sbatch using either the command line or job script; most users find that the job script is the easier approach. The first line of your job script must specify the interpreter that will parse non-Slurm commands; in most cases #!/bin/bash or #!/bin/csh is the right choice. Avoid #!/bin/sh (its startup behavior can lead to subtle problems on Frontera), and do not include comments or any other characters on this first line. All #SBATCH directives must precede all shell commands. Note also that certain #SBATCH options or combinations of options are mandatory, while others are not available on Frontera.

Table 6. Common sbatch Options

Option Argument Comments
-p queue_name Submits to queue (partition) designated by queue_name
-J job_name Job Name
-N total_nodes Required. Define the resources you need by specifying either:
(1) "-N" and "-n"; or
(2) "-N" and "--ntasks-per-node".
-n total_tasks This is total MPI tasks in this job. See "-N" above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set it to the same value as "-N".
--ntasks-per-node
or
--tasks-per-node
tasks_per_node This is MPI tasks per node. See "-N" above for a good way to use this option. When using this option in a non-MPI job, it is usually best to set --ntasks-per-node to 1.
-t hh:mm:ss Required. Wall clock time for job.
--mail-user= email_address Specify the email address to use for notifications.
--mail-type= begin, end, fail, or all Specify when user notifications are to be sent (one option per line).
-o output_file Direct job standard output to output_file (without -e option error goes to this file)
-e error_file Direct job error output to error_file
--dependency= jobid Specifies a dependency: this run will start only after the specified job (jobid) successfully finishes
-A projectnumber Charge job to the specified project/allocation number. This option is only necessary for logins associated with multiple projects.
-a
or
--array
N/A Not available. Use the launcher module for parameter sweeps and other collections of related serial jobs.
--mem N/A Not available. If you attempt to use this option, the scheduler will not accept your job.
--export= N/A Avoid this option on Frontera. Using it is rarely necessary and can interfere with the way the system propagates your environment.

By default, Slurm writes all console output to a file named slurm-%j.out, where %j is the numerical job ID. To specify a different filename use the -o option. To save stdout (standard out) and stderr (standard error) to separate files, specify both -o and -e.

Interactive Sessions with idev and srun

TACC's own idev utility is the best way to begin an interactive session on one or more compute nodes. idev submits a batch script requesting access to a compute node. Once the scheduler allocates a compute node, you are then automatically ssh'd to that node where you can begin any compute-intensive jobs.

To launch a thirty-minute session on a single node in the development queue, simply execute:

login1$ idev

You'll then see output that includes the following excerpts:

...
-----------------------------------------------------------------
        Welcome to the Frontera Supercomputer          
-----------------------------------------------------------------
...

-> After your idev job begins to run, a command prompt will appear,
-> and you can begin your interactive development session. 
-> We will report the job status every 4 seconds: (PD=pending, R=running).

->job status:  PD
->job status:  PD
...
c123-456$

The job status messages indicate that your interactive session is waiting in the queue. When your session begins, you'll see a command prompt on a compute node (in this case, the node with hostname c449-001). If this is the first time you launch idev, you may be prompted to choose a default project and a default number of tasks per node for future idev sessions.

For command-line options and other information, execute idev --help. It's easy to tailor your submission request (e.g. shorter or longer duration) using Slurm-like syntax:

login1$ idev -p normal -N 2 -n 8 -m 150 # normal queue, 2 nodes, 8 total tasks, 150 minutes

You can also launch an interactive session with Slurm's srun command, though there's no clear reason to prefer srun to idev. A typical launch line would look like this:

login1$ srun --pty -N 2 -n 8 -t 2:30:00 -p normal /bin/bash -l # same conditions as above

Consult the idev documentation for further details.

Interactive Sessions using SSH

If you have a batch job or interactive session running on a compute node, you "own the node": you can connect via ssh to open a new interactive session on that node. This is an especially convenient way to monitor your applications' progress. One particularly helpful example: login to a compute node that you own, execute top, then press the "1" key to see a display that allows you to monitor thread ("CPU") and memory use.

There are many ways to determine the nodes on which you are running a job, including feedback messages following your sbatch submission, the compute node command prompt in an idev session, and the squeue or showq utilities. The sequence of identifying your compute node then connecting to it would look like this:

login1$ squeue -u bjones
 JOBID       PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
858811     development idv46796   bjones  R       0:39      1 c448-004
1ogin1$ ssh c448-004
...
C448-004$

Slurm Environment Variables

Be sure to distinguish between internal Slurm replacement symbols (e.g. %j described above) and Linux environment variables defined by Slurm (e.g. SLURM_JOBID). Execute env | grep SLURM from within your job script to see the full list of Slurm environment variables and their values. You can use Slurm replacement symbols like %j only to construct a Slurm filename pattern; they are not meaningful to your Linux shell. Conversely, you can use Slurm environment variables in the shell portion of your job script but not in an #SBATCH directive. For example, the following directive will not work the way you might think:

#SBATCH -o myMPI.o${SLURM_JOB_ID}   # incorrect

Instead, use the following directive:

#SBATCH -o myMPI.o%j     # "%j" expands to your job's numerical job ID

Similarly, you cannot use paths like $WORK or $SCRATCH in an #SBATCH directive.

For more information on this and other matters related to Slurm job submission, see the Slurm online documentation; the man pages for both Slurm itself (man slurm) and its individual commands (e.g. man sbatch); as well as numerous other online resources.