Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
caviness:slurm-manual:intro [2019/08/08 16:43]
frey created
caviness:slurm-manual:intro [2019/08/23 19:13] (current)
frey [What are resources?]
Line 1: Line 1:
 ====== Getting work done:  what is job scheduling? ====== ====== Getting work done:  what is job scheduling? ======
  
 +The idea of //work// on a computer system means entering data (input), running programs that make use of the data, and analyzing the data produced (output). ​ The term //​programs//​ equates with two entities on Linux systems:
 +
 +  * binary executable: ​ source code (Fortran, C/C++, etc.) is compiled and linked to produce low-level machine code executed directly on the computer'​s CPU(s)
 +  * script: ​ source code is read and //​interpreted//​ on-the-fly into low-level machine code that is executed on the computer'​s CPU(s)
 +
 +Getting work done on your laptop or desktop computer usually involves a graphical user interface where your key presses, gestures, and taps or clicks are interpreted to execute programs and enter data.  A less intuitive -- but far more powerful -- command-line interface relies on your entering textual commands to accomplish the same tasks.
 +
 +===== The command-line interface =====
 +
 +The default command-line interface (CLI) on our HPC systems is the Bash shell. ​ The syntax and grammar of Bash encompasses most of the typical constructs of computer programming languages, but purpose-wise Bash focuses heavily on the action of executing other programs and not computation or data processing. ​ The programs executed by Bash on your behalf implement the computation and data processing tasks most closely associated with your work.
 +
 +Getting work done on our HPC systems requires knowledge of the Bash shell. ​ Your efficiency and productivity is, to an extent, directly proportional to your familiarity with the Bash shell. ​ Many excellent tutorials exist online that introduce the Bash shell: see [[https://​swcarpentry.github.io/​shell-novice/​|this Software Carpentry tutorial]], for example.
 +
 +===== Representation of work =====
 +
 +If the work you do on a computer system consists of a series of Bash commands typed on a keyboard, then saving those commands in a file and telling Bash to read from that file (rather than the keyboard) also gets the job done.  Creating such a //Bash script// allows the work to be repeated at any time in the future simply by having a Bash shell read commands from that file.
 +
 +The work you wish to get done on Caviness should be encapsulated in a Bash script. ​ In this way, a //job script// can be executed at some arbitrary time in the future. ​ Job scripts should require no interaction with a user, to ensure that your not being logged-in to the cluster will not hinder your work from being completed.
 +
 +===== Job scheduling =====
 +
 +At any time, the hundreds of users of our HPC systems have more work prepared than there are resources in the system. ​ All of those job scripts are submitted to a piece of software that has the job of:
 +
 +  * storing and managing all of the job scripts
 +  * prioritizing all of the job scripts
 +  * matching the resources requested by the job to available resources
 +  * executing job scripts when and where resources are available
 +  * reacting to completion of the job
 +
 +The Slurm //job scheduler// handles these tasks on the Caviness cluster.
 +
 +==== What are resources? ====
 +
 +On Caviness, the important resources you must consider for each job are:
 +
 +  * Traditional CPU cores
 +  * System memory (RAM)
 +  * Coprocessors (nVidia GPUs)
 +  * Wall time((wall time = elapsed real time))
 +
 +Though default values exist for each, you are encouraged to always make explicit the levels required by a job.  In general, requesting more resources than your job can effectively (or efficiently) use:
 +  - can delay start of your job (e.g. it takes longer to coordinate 10 nodes' being free versus a single node)
 +  - may decrease your workgroup'​s relative job priority versus other workgroups (further delaying future jobs)
 +
 +==== Queues and partitions ====
 +
 +With other job schedulers, a //queue// is an ordered list of work to be performed. ​ There are one or more queues and jobs are submitted to specific queue(s). ​ Each queue has a set of hardware resources associated with it on which the queue can execute jobs.
 +
 +Slurm starts from the other end and uses a //​partition//​ to represent a set of hardware resources on which jobs can execute. ​ A single queue contains all jobs, and the partition selected for each job constrains which hardware resources can be used.
  
  • caviness/slurm-manual/intro.1565282616.txt.gz
  • Last modified: 2019/08/08 16:43
  • by frey