Using the cluster
Queues and limits
You can submit jobs to any of the following SLURM partitions (otherwise known as queues):
- small
Jobs requiring up to 24 cores on the same node.
Maximum run-time 30 days. - large
Jobs requiring more than 24 cores, which may be distributed across multiple nodes. There is no upper limit on the number of nodes (other than how many there are), but remember that you are using a shared resource. Maximum run-time 30 days. - gpu
Jobs requesting one or more of the cluster's GPU nodes. Maximum run-time 30 days. Note that you still need to request the GPUs with the --gres flag. The following line would request two GPUs on a node:#SBATCH --gres=gpu:2
- debug (NOT IMPLEMENTED YET)
Small and short jobs, usually meant for tests or debugging. This partition is limited to one node and a maximum of two hours run-time.
Submitting and monitoring jobs
Calculations need to be submitted to the SLURM queues and will be scheduled to run on the requested compute nodes when those become available. A job script is submitted using the sbatch command:
sbatch <job-script>
Example job scripts for different programs can be found in the directory /usr/local/examples. Please copy one of those and adjust it to your requirements.
You can monitor submitted batch jobs with the squeue command. On its own, it will show all jobs in the queues at the moment. To show only your own, use it with the -u flag
squeue -u <user>
The output will be similar to the following:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1002462 large B6N3_opt akg5 R 3-14:41:13 1 compute067
1002556 large SAKG_5 E akg5 R 2-03:38:21 1 compute066
1002654 small SKG_7a E akg5 R 4:48:38 1 compute066
The meaning of the different fields is as follows:
JOBID: Job number. You need this to find more information about the job or to cancel it (see below).
PARTITION: queue.
NAME: Job name as given with the -J flag in the sbatch command or the #SBATCH -J directive in the job script.
USER: The owner of the job.
ST: Status: R (running), PD (pending, i.e.waiting) or CG (changing, i.e starting or finishing).
TIME: Run-time so far.
NODES: Number of nodes requested.
NODELIST(REASON): Nodes in use (if running); reason for not running otherwise:
(Resources): waiting for the requested nodes.
(Priority): There are other jobs with higher priority.
A batch job can be deleted with the command
scancel <jobid>
If it is already running, it will be stopped.
Example job
Here is a typical batch job:
#!/bin/bash
#SBATCH --job-name=vasp5
#SBATCH --partition=large
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=64
#SBATCH --output=%x_%j.log
# Load required modules
module unload mpi
module load openmpi/5.0.5
module load ucx/1.16.0
export LD_LIBRARY_PATH=/software/GCCLIBS/lib/:$LD_LIBRARY_PATH
# Run vasp
srun --mpi=pmix_v3 /software/VASP/vasp.5.4.4.pl2/bin/vasp_gam
The Job script consists of two sections: SLURM directives determining the requested resources and the job's appearance in the queue, and commands to be executed once the job is running. This job will show up as "vasp5" in the squeue output. It requests two nodes and 32 cores (the maximum available) cores on each node, and will run in the parallel partition. The execution in this case consists of setting up path and environment for a specific compiler/MPI combination by loading the correct modules, and then the command to run an application in parallel using MPI. NOTE THAT THIS JOB IS INCOMPLETE. LOOK AT /usr/local/examples/job-vasp FOR RUNNING VASP!
Installing your own software
If you require software that needs to be installed in a location other than your home directory, or that may be of use to other users, you can request its installation in a publicly accessible location. Your fellow users will thank you for it. It is, however, possible and allowed to install software in your own directory. It is your responsibility to ensure that software you install and its use is legal, ethical and used for the purposes of your research. This means in particular that licensed software is only used within the conditions of its license. The method of installation will depend on the software in question, but the most commonly used approaches are CONDA or compilation from source.
CONDA
Many popular open source Python (and other) packages are available within the CONDA framework. Hypatia provides a location for installing your packages in the directory /gpfs01/software/conda/<user>. The command for setting up your initial environment is
install-conda
Accept all defaults. It will ask permission to add its initialisation to your .bashrc file. This is OK for most purposes and simplifies future use.
If you need multiple packages with incompatible dependencies, please familiarise yourself with conda environments.
Compiling your own software
Software distributed as source code (Fortran, C or C++, and possibly parallelised using MPI) needs to be compiled before execution. Several compilers and MPI versions are installed. For most purposes, the GNU compilers, in combination with Open MPI, give the most efficient and reliable executables. The compilers are in your default search path. You set the environment for MPI by loading the appropriate modules. Here is an example:
# Set up environment for Open MPI with gnu compilers
module load openmpi/5.0.5
module load ucx/1.16.0
export UCX_WARN_UNUSED_ENV_VARS=n
# Set path for OpenBLAS, fftw3 and hdf5
export LD_LIBRARY_PATH=/software/GCCLIBS/lib/:$LD_LIBRARY_PATH
This also shows how to set the library path for OpenBLAS and other commonly used libraries.
A common problem with older Fortran MPI programs is a compilation failure because of "argument mismatch". If this happens, please add the flag "-fallow-argument-mismatch" to the Fortran flags in your makefile. Please ask if you need advice on this.