Difference between revisions of "Rocky Slurm"

From NIMBioS
(Created page with "== Slurm == Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocate...")
 
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Slurm ==
__TOC__
 
= About =


Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters.  As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters.  As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
Line 6: Line 8:




== Commands ==  
= Commands =


Slurm has many commands but the most common ones you will use to run jobs are <code>sbatch</code> and <code>srun</code>.
Slurm has many commands.  Here are a few you'll use when submitting jobs:


{| class='wikitable'
{| class='wikitable'
|-
|-
| srun || A blocking command that submits a job in real time to the cluster.
| '''srun''' || A blocking command that submits a job in real time to the cluster.
|-
|-
| sbatch || A non-blocking command that submits a job to the queue to be run as resources allow.
| '''sbatch''' || A non-blocking command that queues a job to be run as resources allow.
|-
|-
| squeue || Shows information about queued jobs.
| '''squeue''' || Shows information about queued jobs.
|-
|-
| scancel || Stop queued jobs.
| '''scancel''' || Cancel/Stop a job.
|}
|}
= Submitting Jobs =
The most common and preferred way to submit jobs is to create a shell script and then run it with <code>sbatch</code>.
'''myslurmjob.sh'''
<pre>
#!/bin/bash
#SBATCH --job-name=myjob
#SBATCH --output=output.log
srun myprogram
</pre>
</code>
In the script above you can see that we have the script interpreter in the first line, followed by some parameters to pass to the <code>sbatch</code> command, and then finally we use <code>srun</code> to run our program.  The parameters we set are simply a name for the job and a name for the log file to put the job's output.
Now that our script is made we can queue it up using the <code>sbatch</code> command.
<pre>
sbatch myslurmjob.sh
</pre>
You can see your job in the queue using the <code>squeue</code> command.
<pre>
[test_user@rocky7 ~]$ squeue
            JOBID PARTITION      NAME    USER ST      TIME  NODES NODELIST(REASON)
              2947 compute_all    myjob test_use  R      0:05      1 moose1
</pre>
This will show you the job id assigned to your job as well as which nodes your job is running.
You can cancel your job by using the <code>scancel</code> command with the job id.
<pre>
scancel 2947
</pre>
= Examples =
[[Rocky_Slurm_Basic_Multi | Basic Multiple Node/Tasks Examples]]

Latest revision as of 20:41, 25 August 2022

About

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

You can learn more information at the Slurm website.


Commands

Slurm has many commands. Here are a few you'll use when submitting jobs:

srun A blocking command that submits a job in real time to the cluster.
sbatch A non-blocking command that queues a job to be run as resources allow.
squeue Shows information about queued jobs.
scancel Cancel/Stop a job.

Submitting Jobs

The most common and preferred way to submit jobs is to create a shell script and then run it with sbatch.


myslurmjob.sh

#!/bin/bash

#SBATCH --job-name=myjob
#SBATCH --output=output.log

srun myprogram

In the script above you can see that we have the script interpreter in the first line, followed by some parameters to pass to the sbatch command, and then finally we use srun to run our program. The parameters we set are simply a name for the job and a name for the log file to put the job's output.


Now that our script is made we can queue it up using the sbatch command.

sbatch myslurmjob.sh


You can see your job in the queue using the squeue command.

[test_user@rocky7 ~]$ squeue
             JOBID PARTITION       NAME     USER ST       TIME  NODES NODELIST(REASON)
              2947 compute_all    myjob test_use  R       0:05      1 moose1

This will show you the job id assigned to your job as well as which nodes your job is running.


You can cancel your job by using the scancel command with the job id.

scancel 2947


Examples

Basic Multiple Node/Tasks Examples