If you want to run workload on a cluster you are often required to run this through a so called queuing system. Within the HPC space there are various This article offers a generic overview of what this entails and pointers to usage of the most commonly installed queuing systems available.
What exactly is a queuing system?
A queuing system, also known as job scheduler or batch system, manages the available resources for computations. Furthermore is provides an insight to available and utilized CPU power and memory. Its main purpose is to allow for submission and management of the workload of the end-users and its impact on the cluster resources.
Every queuing system available shares the same basic funtionality:
- Monitor the status of the nodes (up, down, load average)
- Monitor all available resources (available cpu-cores, memory on the nodes)
- Monitor the jobs state (queued, on hold, deleted, done)
- Control the jobs (freeze/hold the job, resume the job, delete the job)
Some advanced options in queuing systems can prioritize jobs, provide statistical data and allow for in-job check-pointing mechanisms to freeze a job.
Instead of assigning cores and memory per node and do all sorts of difficult tracking, the queuing system will manage this for you. Furthermore it will allow to clean up after a job has finished to ensure that the resources are cleanly available for future computations.
When a queuing system is installed and setup properly users will find it very handy to request certain resources, submit and let the queuing system handle everything.
Whenever a job is submitted, the queuing system will check on the resources requested by the job-script. It will assign cores and memory to the job and send the job to the nodes for computation. If the required amount of cores or memory are not yet available, it will queue the job until these resources become available.
The queuing system will keep track of the status of the job and return the resources to the available pool when a job has finished (either deleted, crashed or successfully completed).
Jobs can only be submitted wrapped in a script. A so called job script. This script looks pretty much like a shell script and you can put certain commands and variables in there needed for your job, i.e. load applicable modules or set environment variables. You can also put in some directives for the queuing system, for example, request certain resources, control the output, set an email address.
Please see the job-scripts articles for more information and available templates.