Update - ARCCA Hawk (University Supercomputer Service) - Maintenance Update
Please be advised that the planned maintenance to the ARCCA Hawk supercomputer (OCF Hawk software stack deployment - OCF Steel) is currently behind the advertised schedule such that the timetable for service restart will not be completed by close of play on Friday, 27th January.
OCF encountered issues during setting up the base configuration that delayed the subsequent installation tasks, leading to a negative impact on the project’s timetable. Our regular contact with OCF, with end of day meetings on a daily basis, has led us to issue this update.
We envisage the restoration of the full service proceeding along the 3 steps summarised below:
1. Pilot service on “core” partitions (Expected: close of play, Tuesday 31st January):
We anticipate that OCF will have the core partition nodes and slurm (workload manager) configured by close of play on Friday 27th January, with the commencement of a pilot service for users to follow by the close of play on Tuesday, 31st January.
Note that this pilot service will display the following features:
a. Provide users with full access to their data.
b. Enable users to run jobs on the “core” compute partitions.
c. Enable any issues to be identified before the service moves to full production status.
d. To save time, by aligning this service with the Software Stack Acceptance tests. With the latter taking preference, all users need to be aware and accept that any job may be killed at short notice.
2. Researcher expansion systems. Expected configuration activity to take place week commencing 30th January.
Researcher expansion systems will be configured throughout the week commencing 30th January. These include the LIGO and Ser Cymru systems as well as the additional dedicated research expansion slurm partitions.
3. Production service (Expected w/c 6th February)
We are confident that all works will be completed by the end of next week i.e. by close of play on Friday, 3rd February. This will result in the service entering full production mode once again on Monday 6th February.
Should any further issue emerge in the days ahead, we will notify the user community through the traditional channels.
In the meantime, we apologise for the inconvenience caused by this delay.
Christopher Dickson – Service Manager
Jan 27, 2023 - 16:41 GMT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jan 16, 2023 - 08:01 GMT
Hawk (University Supercomputer Service)
Planned Maintenance - Monday, 16th January 2023 to Friday, 27th January 2023
What is being done?
ARCCA and OCF, the new software supplier for Hawk, are undertaking planned maintenance on the Hawk supercomputer and its supporting infrastructure.
When is this being done?
Following discussions with OCF and taking account community feedback, we have agreed the timetable for the schedule of work to install the new software environment onto Hawk.
This work will require a full outage beginning 08:00 Monday, 16th January, lasting up until 17:00 Friday, 27th January 2023.
Should the work be completed early we will of course endeavour to bring Hawk back into service as soon as possible.
Who will be affected?
Users of Research Computing services provided by ARCCA or Supercomputing Wales (SCW).
How will this affect service users?
Hawk and its supporting infrastructure will be unavailable during the specified window.
Why is this being done?
This is a contractual obligation to upgrade the High-Performance Computing software environment (HPC) to OCF's supported deployment.