Office of Research Computing

Biannual ORC Maintenance

The Hopper cluster will undergo biannual maintenance this January. Hopper will be unavailable between 6:00AM on Thursday, 01/16/2025 and the end of Friday, 01/17/2025 for scheduled maintenance. We plan to upgrade the compute, network, and storage infrastructure of the cluster during this maintenance window.   

Maintenance Schedule:  
– Start Time: Thursday, January 16, 6:00AM  
– End Time: Friday, January 17, End of Day  

Affected services:  

1. Login/Head and Compute Nodes: All login/head and compute nodes will be inaccessible. Users will not be able to log in or submit jobs during this time.  

2. Open On-Demand (OOD): The Open On-Demand (OOD) interface will be inaccessible. Users are advised to plan their activities accordingly and save any unsaved work.  

3. Storage and Related Services: Most storage services, including Samba and Globus, will have very limited and intermittent availability.   

4. Virtual Machines: Most virtual machines will have intermittent availability. Users with virtual machines should be prepared for potential interruptions in service.  

All SLURM partitions will be drained when the maintenance window starts. Any jobs started between now and the maintenance period must be timed to end before the maintenance window begins. When starting a job, make sure to set the time parameter in SLURM to less than 7 days and reduce it as the maintenance window gets closer, for example:  

We have provided a script named  ‘time-to-maintenance-window.py ‘ which you can run to generate an appropriate “–time” parameter for your scripts:

 >$ time-to-maintenance-window.py
A maintenance window scheduled to start on 2025-01-16 06:00:00.
Please use the following or smaller time limit. Note that some partitions have time limits less than 2-16:00:00.

SBATCH –time 2-16:00:00

 Please note that any jobs that are configured to run past the planned downtime dates will not start until after the maintenance is complete.   

The clusters should be back online by the end of 01/17/2025.    

  

If you have any questions or concerns, please email us at [email protected].   

Hopper and Argo Duo Multi-Factor Authentication Notice

To address compliance requirements and security threats, we are enabling Duo Multi-Factor Authentication (MFA) to minimize chances of unauthorized access to ORC resources and to protect your data.

This is a notice that the Hopper and Argo clusters will begin requiring Duo MFA starting Tuesday, 06/04/2024. Users will be prompted to accept a Duo MFA push notification to gain SSH access to the Hopper and Argo HPC clusters. Because users are already enrolled in GMU’s Duo instance, as required by other campus IT services, they should already have all the necessary tools to get Duo push notifications.

Affected services:

1. SSH to login/head nodes: All login/head nodes will require DUO MFA.

2. Other secure file transfer and synchronization tools that use SSH such as rsync, scp and sftp will also be subject to Duo MFA.

Scheduled Hopper and Argo Maintenance Downtime June 2 – 4, 2024 

Due to a scheduled data center power outage on Sunday 06/02/2024, ORC will start its biannual maintenance window to coincide with that to minimize future disruptions. This is a reminder that the Hopper and Argo ORC clusters will be unavailable between 6:00AM on Sunday, 06/02/2024 and the End of Tuesday, 06/04/2024 for scheduled maintenance. We plan to upgrade the compute, network, and storage infrastructure of the cluster during this maintenance window.  

Maintenance Schedule: 

– Start Time: June 02, 6:00AM 
– End Time: June 04, End of Day 

Affected services: 

1. Login/Head and Compute Nodes: All login/head and compute nodes will be inaccessible. Users will not be able to log in or submit jobs during this time. 

2. Open On-Demand (OOD): The Open On-Demand (OOD) interface will be inaccessible. Users are advised to plan their activities accordingly and save any unsaved work. 

3. Storage and Related Services: Most storage services, including Samba and Globus, will have very limited and intermittent availability.  

4. Virtual Machines: Most virtual machines will have intermittent availability. Users with virtual machines should be prepared for potential interruptions in service. 

All SLURM partitions will be drained when the maintenance window starts. Any jobs started between now and the maintenance period must be timed to end before 6:00AM Sunday June 2.  When starting a job, make sure to set the time parameter in SLURM to less than 6 days and reduce it as the maintenance window gets closer, for example: 

#SBATCH — time= 4-00:00:00 ## Days-Hours:Mins:Secs - calculate backwards from 6/2 6:00AM  

Please note that any jobs that are configured to run past the planned downtime dates will not start.   

The clusters should be back online by 06/4/2024.   

If you have any questions or concerns, contact the ORC at [email protected].  

Winter Break Support – important information.

Please be advised that George Mason University will be closed for Winter Break  

Monday, December 19, 2022 – Monday, January 2, 2023.   

During the break, the ORC resources such as the Hopper cluster are expected to be up and functioning as normal. ORC will respond to urgent catastrophic events, however, routine questions or tickets filed at [email protected] may not receive a response until after the break. Weekly regular activities such as the ORC New User tutorials will be suspended until university offices reopen on Tuesday, January 3, 2023. 

Please continue to check the ORC Website for more updates and other upcoming events. 

The ORC wishes you a safe and happy holiday season.

ORC resources suffering from widespread network disruption

Network access to ORC resources is currently disrupted due to configuration changes made to integrate new network infrastructure hardware.  The HOPPER and ARGO clusters, Virtual host systems and network data shares may be inaccessible, or only intermittently available until the problem is resolved.  Engineers are working with the equipment providers support team to diagnose and resolve the issue. We appreciate this is very disruptive and apologize for the inconvenience.

When it is available, further information regarding the estimated down time will be posted here and sent to the ARGO-USERS mailing list.

Update 11/9/2021 22:50. Engineers from Dell believe they have resolved the connectivity issues on the Dell hardware.  However, as of now the clusters remain unresponsive.  This may be due to storage server problems caused by the network outage or with campus networking.  We will be engaging with GMU IT support in the morning to perform additional analysis of the issue.  We hope to have all clusters and systems available before end-of-day Wednesday 11/10/2021.

Update 11/10/2021 12:30. All issues have been resolved and the HOPPER and ARGO clusters are available.

 

Announcing the Hopper Cluster (Hopper)

Announcing the Hopper Cluster (Hopper) 

The ORC would like to invite you to use Hopper its new high performance compute cluster. Hopper is named in honor of the late rear admiral Grace Hopper, a computing pioneer and local resident. All new ORC cluster accounts will be created and activated on Hopper by default. However, existing Argo cluster account holders should send an email to [email protected] to request activation of their account on Hopper.

Hopper currently has a total of 70 compute nodes each node with 48 cores (Intel Cascade Lake) and 188 GB of available memory. Currently, 28 nodes and the GPU node are freely available for all users. The remaining nodes may also be used but jobs will be subject to preemption by jobs run by the node’s sponsors. There is one Nvidia DGX GPU node with 128 CPU cores (AMD EPYC/Milan), 1 TB of memory, and 8xA100 GPUs.  

A large expansion of Hopper is planned for the Fall of 2021 which will add a substantial number of compute and GPU nodes including very large memory nodes with up to 4 TB of memory. Users who require memory address spaces greater that 180 GB will need to continue to use the Argo cluster until the new large memory nodes become available in Hopper. 

The Hopper cluster is configured in a similar but not identical fashion to Argo. The software modules are organized differently and there are differences in the partition names, defaults, and versions of software available. Please review the documentation linked below for more detailed information on the differences.  

You may log in to Hopper using “ssh <UserID>@hopper.orc.gmu.edu,” where “<UserID>“ is your GMU NetID, use your GMU campus password when prompted. Home, scratch, and project directories will be mounted in the same locations as on Argo. Let us know if there are any “groups” directories you need to access, or if there are specific software packages and versions you require that are not available. The partition/queue structure on Hopper is summarized in the table below: 

Partition  Time Limit (D-H:M)  Description  ARGO Equivalent 
debug  0-01:00  Intended for quick tests   
interactive  0-12:00  Interactive jobs (Open OnDemand)   
normal  3-00:00  default  all-LoPri, all-HiPri, bigmem-HiPri, bigmem-LoPri, all-long, bigmem-long 
contrib*  6-00:00    CDS_q, COS_q, CS_q, EMH_q 
gpuq  1-00:00  GPU node access  gpuq 

*NOTE:  Being a contributor on Argo does not automatically grant access to the contrib partition on Hopper. All users may submit jobs to the contrib partition on Hopper, however, their jobs may be preempted and killed by a contributor’s job at any time. We recommend that non-contributor users who submit to the contrib partition ensure their jobs use some form of checkpointing. Contact [email protected] if you need help implementing checkpointing in your jobs. 

Open OnDemand 

We would also like users to try our new Open OnDemand (OOD) Server, which enables launching interactive apps including RStudio, Jupyter Lab, MATLAB and Mathematica, or a Linux graphical desktop through a web interface. These interactive sessions can be used for up to 12 hours. From a web browser, login to https://ondemand.orc.gmu.edu using your GMU username and credentials to access the OOD server. Please let us know of any problems you encounter, and any applications you would like to be able to use via Open OnDemand. 

Documentation 

Please refer to the following links for current documentation on Hopper: 

If you have any questions about any aspect of the new Hopper cluster, please send an email to [email protected]. 

 

ARGO Scratch File system migration – Cluster unavailable in the AM 03/13/2021

There is a  planned interruption to the availability of the Argo cluster.

On the morning of 03/13/2021, the scratch filesystem will be migrated to new hardware.   The file system must be quiescent during the data transfer so the entire ARGO cluster will be unavailable from 5 am for a few hours. All partitions are being drained. Jobs will start running again
by the afternoon.

Maintenance Scheduled for ORC hosted Virtual Machines and Servers.

A maintenance period has been scheduled for Tuesday 1/19/2021 between 8 am and 11 am, during which it is planned that all ORC VMs and hosted servers will be patched and rebooted.  There will necessarily be short periods, generally no longer than 15 minutes, during which the systems will be unavailable.  If this maintenance would cause disruption to your work, then please let us know asap so that we can make the necessary adjustments.

The Argo Cluster will not be affected by this maintenance.

Christmas Break 2020 – Support

The University will break for the holidays on December 18th 2020, and normal working hours will resume on January 4th 2021. Account requests made after 12pm on December 18th will not be processed until January 4th. Staff will be monitoring email and help tickets and may be able to respond to urgent inquiries but please be aware this will be on a best effort basis. If you have concerns or questions please send email to [email protected].