High Performance Computing

Overview of UMS ACG HPC systems:

The main High Performance Computing resource that the UMS Advanced Computing Group provides is a growing set of computers (nodes) and storage systems linked together with a high speed, low latency Infiniband network. The general term used for this type of system is a "Compute Cluster" made up of:

Programs can be run on individual cores on a single node or they can run on multiple nodes using hundreds of cores at the same time. The General Purpose GPU nodes have multiple GPUs in them and they can be used individually or together or on individual programs. 

All of the compute nodes are managed by a central Resource Manager and Scheduler called SLURM. SLURM keeps track of all of the resources of the cluster and the jobs that are run on the cluster. In order to run a program on the cluster, a "job" needs to be submitted. A "job" is a request for resources (for instance what type of nodes to use,  how many nodes, CPUS, amount of Memory, and how much time these are needed), along with a command to run a program that will use those resources. The different types of nodes are grouped in SLURM entities called "partitions".

Penobscot HPC system:

The Penobscot HPC system is our newest cluster setup. While traditional terminal-based SSH connections are still possible, Penobscot also runs Open OnDemand that allows you to do everything from within a web browser. This makes it much easier to get started. This system also has our three newest GPU nodes with 9 new GPUs. We will continue to transition nodes from the Katahdin cluster to Penobscot. In general, both clusters have the same SLURM partition names to help make it easier to transition jobs.

The basic software is installed, still using the same "module" program to manage environments. We will be adding software regularly.

Katahdin Legacy Cluster:

We envision that there will be some cases where people will want to continue using the Katahdin environment for a while. The main reason for this is that there might be certain packages/modules that do not work on Penobscot. For instance we have identified one package that requires Python 2.7, which is not supported on Penobscot. For this reason, we will be keeping a subset of nodes on Katahdin. Please contact us (acg@maine.edu) if you need help moving your workload over to Penobscot, or if you think you need to keep running on Katahdin for a while.

Types of nodes and their SLURM partitions:

There are a few types of nodes/partitions available:

CPU Nodes:

GPU Nodes: total of 25 GPUs in 5 systems

How to use the HPC resources:

Open OnDemand: 

Open OnDemand makes all cluster interaction possible through a web browser. Login to the login node at https://login.acg.maine.edu, manage files, submit jobs, run interactive programs like RStudio, Jupyter, Matlab and a terminal directly on nodes (including GPU nodes). Even bring up a graphical Desktop to help with the development and debug process. All of this from within a web browser. No need to install software or set up tunnels. 

SLURM Job Scheduler:

Programs can be run on the nodes by running a command to submit a "job script" to the SLURM scheduler. SLURM keeps track of all of the resources (e.g. CPU, GPU, Memory) on all of the nodes as well as all of the programs being run on the nodes. When new jobs are submitted, it is up to the SLURM scheduler to figure out what nodes should run the program and to start the requested program when the requested resources are available.

Software:

A wide range of software has been compiled or installed on the HPC system, including scientific packages, libraries and compilers. In fact, some packages have multiple versions installed. In order to make it easy to manage what packages should be used either interactively on the Login node, or in a job submitted to SLURM, the software is managed by the LMod Environment Module system. The "module" program can be run in a terminal manage what software is currently active in the terminal/shell environment. Since compilers are provided, people can also compile and install programs and packages into their own account. Similarly, since the Anaconda Python system is provides in a module, people can create their own Anaconda Virtual Environment and install Python packages into them. This puts full control into each users hands.

Containers:

Containers offer a way to install and run non-standard, complex or hard to install programs on a wide range of hardware and Operating Systems. A container encapsulates everything that is needed for the program to run into a single file. Then that file can be copied to a system that can run containers, even if that other system is running a completely different Operating System or Linux distribution. 

The ACG HPC systems use Apptainer (also called Singularity) to run containers. Apptainer is compatible with Docker, so a huge range of pre-packaged software is available for use. For instance, Nvidia provides a large catalog of Docker containers providing a wide range of software that is optimized to run on their GPU systems. We have converted some of these containers into Singularity and made them available as a software module to be used on our systems.