ARCCA Hawk Supercomputer - Service Fully Restored

Incident Report for Cardiff University

Resolved

Hawk is now fully operational and full service has been restored.
Posted Oct 02, 2023 - 11:45 BST

Identified

The ARCCA Hawk supercomputer is now accepting jobs and the majority of the Hawk service is available.

However, the high memory partition is still unavailable. Further remedial work continues on that partition and a minority of nodes which still have issues.
Posted Sep 22, 2023 - 16:22 BST

Update

We are continuing to investigate, with colleagues and supplier partners, the outage affecting a number of the Hawk nodes following the weekend power outage.

We are endeavouring to bring the service back online as soon as we can. However until we are sufficiently confident that this issue has been resolved, the compute partitions within the job scheduling system (Slurm) will remain offline to prevent job submission.

While this also impacts user access to the ARCCA OnDemand (web interface) service, please note that the login nodes are accessible should you need to access your data.

Thank you once again for your patience.
Posted Sep 19, 2023 - 14:37 BST

Investigating

We are experiencing an issue with the University's Hawk supercomputing service. At present while service users can access files on Hawk, they are unable to run HPC jobs. We are urgently working to resolve the issue.
Posted Sep 19, 2023 - 09:33 BST
This incident affected: Research (ARCCA Processing and Analysis (Hawk)).