Running jobs
Caviness uses the Slurm workload manager, a free and open-source resource management and job scheduling system. Users are expected to submit jobs with accurate resource requirements (node and core counts, memory size limits, time limits) so that Slurm can match your work against available compute resources. Some general rules of thumb with regard to job submission:
- Do not request more resources than a job needs.
- If a job will finish in 2 hours, do not request 48 hours; this will increase wait times for everyone.
- Requesting more memory than the job will use will needlessly penalize your workgroup's job priority and probably delay your job's execution
- Requesting too many nodes/cores is not always the best choice; there is no substitute for software benchmarking to ensure you know how well a program scales on Caviness
- Use the job script templates that IT provides. The templates in
/opt/templates/slurm
are documented heavily and will save you time, since they are tested by IT and have predictable behavior. - Test your job scripts. Ensure a new job script you are writing works on one or two nodes before trying it on 20. Work out any submission issues with short (5 minute) time limits and submit against the
devel
partition for optimum throughput. - Don't get greedy. If your job has very large per-core memory requirements, don't submit many such jobs at once. Your jobs will likely leave many cores idle, lacking enough memory to run other users' jobs.
- Avoid heavy i/o in workgroup storage. Jobs that are i/o intensive should always be run on the
/lustre/scratch
file system. Intensive i/o on your workgroup storage will impact everyone on the cluster.