Differences

This shows you the differences between two versions of the page.

--- caviness:slurm-manual:intro [2019/08/08 16:43]
frey created
+++ caviness:slurm-manual:intro [2019/08/23 19:13] (current)
frey [What are resources?]
@@ Line 1: / Line 1: @@
 ====== Getting work done:  what is job scheduling? ======
+The idea of //work// on a computer system means entering data (input), running programs that make use of the data, and analyzing the data produced (output).  The term //programs// equates with two entities on Linux systems:
+  * binary executable:  source code (Fortran, C/C++, etc.) is compiled and linked to produce low-level machine code executed directly on the computer's CPU(s)
+  * script:  source code is read and //interpreted// on-the-fly into low-level machine code that is executed on the computer's CPU(s)
+Getting work done on your laptop or desktop computer usually involves a graphical user interface where your key presses, gestures, and taps or clicks are interpreted to execute programs and enter data.  A less intuitive -- but far more powerful -- command-line interface relies on your entering textual commands to accomplish the same tasks.
+===== The command-line interface =====
+The default command-line interface (CLI) on our HPC systems is the Bash shell.  The syntax and grammar of Bash encompasses most of the typical constructs of computer programming languages, but purpose-wise Bash focuses heavily on the action of executing other programs and not computation or data processing.  The programs executed by Bash on your behalf implement the computation and data processing tasks most closely associated with your work.
+Getting work done on our HPC systems requires knowledge of the Bash shell.  Your efficiency and productivity is, to an extent, directly proportional to your familiarity with the Bash shell.  Many excellent tutorials exist online that introduce the Bash shell: see [[https://swcarpentry.github.io/shell-novice/|this Software Carpentry tutorial]], for example.
+===== Representation of work =====
+If the work you do on a computer system consists of a series of Bash commands typed on a keyboard, then saving those commands in a file and telling Bash to read from that file (rather than the keyboard) also gets the job done.  Creating such a //Bash script// allows the work to be repeated at any time in the future simply by having a Bash shell read commands from that file.
+The work you wish to get done on Caviness should be encapsulated in a Bash script.  In this way, a //job script// can be executed at some arbitrary time in the future.  Job scripts should require no interaction with a user, to ensure that your not being logged-in to the cluster will not hinder your work from being completed.
+===== Job scheduling =====
+At any time, the hundreds of users of our HPC systems have more work prepared than there are resources in the system.  All of those job scripts are submitted to a piece of software that has the job of:
+  * storing and managing all of the job scripts
+  * prioritizing all of the job scripts
+  * matching the resources requested by the job to available resources
+  * executing job scripts when and where resources are available
+  * reacting to completion of the job
+The Slurm //job scheduler// handles these tasks on the Caviness cluster.
+==== What are resources? ====
+On Caviness, the important resources you must consider for each job are:
+  * Traditional CPU cores
+  * System memory (RAM)
+  * Coprocessors (nVidia GPUs)
+  * Wall time((wall time = elapsed real time))
+Though default values exist for each, you are encouraged to always make explicit the levels required by a job.  In general, requesting more resources than your job can effectively (or efficiently) use:
+  - can delay start of your job (e.g. it takes longer to coordinate 10 nodes' being free versus a single node)
+  - may decrease your workgroup's relative job priority versus other workgroups (further delaying future jobs)
+==== Queues and partitions ====
+With other job schedulers, a //queue// is an ordered list of work to be performed.  There are one or more queues and jobs are submitted to specific queue(s).  Each queue has a set of hardware resources associated with it on which the queue can execute jobs.
+Slurm starts from the other end and uses a //partition// to represent a set of hardware resources on which jobs can execute.  A single queue contains all jobs, and the partition selected for each job constrains which hardware resources can be used.