You share Frontera with hundreds of other users, and what you do on the system affects others. Exercise good citizenship to ensure that your activity does not adversely impact the system and the research community with whom you share it.
Citizenship on Frontera
1. Do Not Run Jobs on the Login Nodes
When you connect to Frontera you share the login node with dozens of other users.
It is imperative that you do not run jobs on the login nodes. Doing so is the fastest route to account suspension.
You must avoid computationally intensive activity on login nodes. This means:
- Don't run research applications on the login nodes; this includes frameworks like MATLAB and R. If you need interactive access, please use idev or srun to schedule a compute node.
- Don't launch too many simultaneous processes: while it's fine to compile on a login node, a command like
make -j 16(which compiles on 16 cores) may impact other users.
- That script you wrote to check job status should probably do so once every few minutes rather than several times a second.
Know when you're on a login node. You can use your Linux prompt, the
hostnamecommand, or other mechanisms to determine if you're on a login or a compute node. See Accessing the Compute Nodes for more information.
2. Do Not Stress the Shared Lustre File Systems
This section focuses on ways to avoid causing problems on
$SCRATCH. File Systems above is a brief overview of these file systems. Configuring Your Account covers environment variables and aliases that help you navigate the file systems.
Run I/O intensive jobs in
$WORK. If you stress
$WORK, you affect every user on every TACC system.
Avoid opening and closing files repeatedly in tight loops. Every open/close operation on the file system requires interaction with the MetaData Service (MDS). The MDS acts as a gatekeeper for access to files on Lustre's parallel file system. Overloading the MDS will affect other users on the system. If possible, open files once at the beginning of your program/workflow, then close them at the end.
Avoid storing many small files in a single directory, and avoid workflows that require many small files. A few hundred files in a single directory is probably fine; tens of thousands is almost certainly too many. If you must use many small files, group them in separate directories of manageable size.
Stripe the receiving directory before creating large files in the directory or transferring large files to the directory. See Striping Large Files for more information.
Don't run jobs in
$HOMEfile system is for routine file management, not parallel jobs.
Don't get greedy. If you know or suspect your workflow is I/O intensive, don't submit a pile of simultaneous jobs. Writing restart/snapshot files can stress the file system; avoid doing so too frequently. Also, use hdf5 or netcdf to generate a single restart restart file in parallel, rather than generating files from each process separately.
Watch your file system quotas. If you're near your quota in
$WORKand your job is repeatedly trying (and failing) to write to
$WORK, you will stress the file system. If you're near your quota in
$HOME, jobs run on any file system may fail, because all jobs write some data to the hidden
Limit File Transfers
In order to not stress both internal and external networks, limit simultaneous and recursive file transfers.
Avoid too many simultaneous file transfers. You share the network bandwidth with other users; don't use more than your fair share. Two or three concurrent
scpsessions is probably fine. Twenty is probably not.
Avoid recursive file transfers, especially those involving many small files. Create a tar archive before transfers. This is especially true when transferring files to or from Ranch.
Request Only the Resources you Need
When you submit a job to the scheduler, don't ask for more time than you really need. The scheduler will have an easier time finding a slot for the 2 hours you need than the 24 hours you request. This means shorter queue waits times for you and everybody else.
Test your submission scripts. Start small: make sure everything works on 2 nodes before you try 200. Work out submission bugs and kinks with 5 minute jobs that won't wait long in the queue and involve short, simple substitutes for your real workload: simple test problems;
hello worldcodes; one-liners like
ibrun hostname; or an
lddon your executable.
Respect memory limits and other system constraints. If your application needs more memory than is available, your job will fail, and may leave nodes in unusable states. Monitor your application's needs. Execute
module load remorafollowed by
module help remorafor more information on a particularly handy monitoring tool.
TACC staff cabling Frontera