Frequently Asked Questions

Argo cluster FAQs.

What is the Argo Cluster?

The Argo Cluster is a High Performance compute cluster operated by the Office of Research Computing.  It is located in the Aquia Data Center on the Fairfax Campus.

How many nodes are in the cluster?

The Cluster currently comprises 58 nodes (54 compute, 2 storage and 2 head nodes) with a total of 1060 compute cores and over 4 TB of RAM. Out of the 54 compute nodes, 2 are FAT nodes with 512GB memory each, 2 are GPU nodes with K80 GPU cards.

What are the hardware specifications of the cluster?

37 compute nodes are Dell C8220 with Dual 8 core Intel Xeon E5-2670 CPUs and 64 GB of RAM .

13 compute nodes are Dell FC430 with Dual 10/12 Intel Xeon E5-2660 CPUs with 96GB to 128GB RAM.

2 compute nodes are large memory configuration Dell PowerEdge R812s each with four 16 core Opteron 6276 CPUs and 512 GB of RAM.

2 GPU compute nodes with K80 GPU cards and 128GB RAM.

A FDR infiniband network provides interconnect and access to 130 TB of highly available NFS storage.

How do I access the cluster?

You SSH into the head node using the hostname “argo.orc.gmu.edu”.   There are two head nodes of the ARGO cluster – argo-1 and argo-2 and users are logged into one of these in round-robin manner to manage load.    Use your GMU NetID and password to log into the cluster.

What are modules?

The Argo cluster uses a system called Environment Modules to manage applications. Modules make sure that your environment variables are set up for the software you want to use. When you login the two modules “SGE” and “GCC” are loaded by default. SGE/Univa is the grid scheduler that manages the job submission, deletion etc. The main commands are:

  • module avail” shows all the available modules.
  • module list” shows the modules that you have loaded at the moment.
  • module load name” or “module add name” adds the module “name” to your environment.
  • module unload name” or “module rm name” removes the module “name” from your environment.
  • module show name” or “module display name” gives a description of the module and also shows what it will do to your environment.

Typing “module” gives you a list of available command and arguments.

Can I run jobs on the head node?

You can use the head node to develop, compile and test a sample of your job before submitting to the queue. Users cannot run computationally intensive jobs on the head nodes. If such jobs are running on the head node, they will killed without notice.

All jobs have to be submitted on the head node via  Slurm scheduler which will schedule your jobs to run on the compute nodes.

Can I log into individual nodes to submit jobs?

Users should not log into individual nodes to run jobs. Users have to submit jobs to the scheduler on the head node. Compute intensive jobs running on nodes that are not under scheduler control (i.e. directly started on the nodes) will be killed without notice.

Users can log into nodes on which their jobs are running which were previously submitted via the scheduler. This ability to ssh into individual nodes is only for checking on your job/(s) that is currently running on that node. Please note that if users are using this mode of “sshing” into nodes to start new jobs on the nodes without going through the scheduler, then their ability to ssh into nodes to check on jobs will be removed.

Do you have a quota for each user?

Currently there are no quotas specified, but later we will be assigning quotas for the users.

What are the partition (queue) names?
Partition Name Nodes in Partition Restricted Access

all-HiPri*

Node001-Node039,Node041-Node049,Node51-Node054

no

all-LoPri

Node001-Node039,Node041-Node045,Node51-Node054

no

bigmem-LoPri

Node034, Node035

no

bigmem-HiPri

Node034,Node035

no

COS_q

Node028-Node035

yes

CS_q

Node007-Node024

yes

CDS_q

Node046-Node050

yes

HH_q

Node025-Node027

yes

STATS_q

Node036,Node037

yes

*all-HiPri is the default partition (queue).

all-HiPri and bigmem-HiPri both have a run time limit of 12 hours.  Jobs exceeding the time limit will be killed.  all-LoPri and bigmem-LoPri both have a 5 day run time limit for jobs.   The partitions bigmem-LoPri and bigmem-HiPri are intended for jobs that will require a lot of memory. Access to the queues marked as “restricted access” are limited to members of research groups and departments that have funded nodes in the cluster.

How do I submit jobs?

The command for submitting a batch job is:

$ sbatch <script file name> (The default partition is all-HiPri)

If the command is successful you will see the following:

Your job <job id number> (“script file name”) has been submitted.

You can also submit the job without a script by listing the options (see the next question for options) on the command line.

$ sbatch   [options] <jobname>

What are options one can use with sbatch?

Some of the common options that one might use with sbatch are:

  • -J=<name-of-job> – Use <name-of-job> instead of the default job name which is the script file name.
  • -i=/path/to/dir/inputfilename – Use the “inputfilename” as input file for this job.
  • -o=/path/to/dir/outputfilename – Use the “outputfilename” as the output file for this job.
  • -e=/path/to/dir/errorfilename – Use the “errorfilename as the file name for errors encountered in this job.
  • -n=<number> – The number of tasks to run, also specifies number of slots needed.
  • –mem=<MB> – The total memory needed for the job, use if more than the default is needed
  • –mail-user=GMU-NetID@gmu.edu – Send mail to your GMU email account
  • –mail-type=BEGIN,END – Send email before and  after end of job.

Read the man pages (man sbatch) for more sbatch options.

Where can I find examples of job scripts?

An example script is in /cm/shared/apps/slurm/current/examples directory.   Experiment with “sleeper.sh” to get started.

How do I check status of jobs?

squeue -u <userID>  (will list your jobs that are running)

$ squeue (will list jobs of all users)

Jobs Status:

  • PD”   – Job is queued and waiting to be scheduled.
  • S”       – Job is suspended.
  • R”       – Job is running.
  • CA”     – Job marked for deletion.
  • F”        – Job failed.

 

See man pages (man squeue) for more details.

How do I find out information about completed jobs?

$ sacct –j <job id number>  – After the job has completed, you can use this command

The <job id number> is what you get back when you successfully submit your job via sbatch command.

 

How do I delete jobs?

Use the following command to delete submitted jobs:
$ scancel <job id number>

Where can I find more information about Slurm scheduler?

You can read the man pages of the various commands.  The Slurm documentation is available here: http://slurm.schedmd.com.  Please note that the documentation available at that website is for the latest release.  We are running Slurm version 15.08.6, so man pages will give you the correct description.

Can we request multiple cores/slots?

You use the command “ –ntasks <number>” either on the command line with sbatch or in the script file to request <number> cores.

NOTE: If you job is inherently multi-threaded (eg. Java jobs), then you need to use this option to specify the number of cores you want for the job.

I am new to linux, do you have any tutorials?

We don’t have a tutorial as yet, but here is a good one to get started: Linux Tutorial

I use Windows, how I do log into the cluster?

You need to download and install Secure Shell for Windows from the GMU IT Services website: IT Downloads .

Here is a website that teaches you how to install and use the Secure Shell : Windows Secure Shell Example

NOTE: For GMU connections, the Authentication method is “Password”.

You can use Secure FTP application to upload/download files.

Do you have sample scripts?

The script file will contain all the options one needs to use for a specific job. The “##” is a comment line. But a “#SBATCH” is a line containing submit options. The first line is always “#!” which specifies the beginning of shell script.

#!/bin/bash
#
## Specify Job name if you want
## the short form -J
#SBATCH --job-name=My_Test_Job
##
## Specify a different working directory
## Default working directory is the directory from you submit your job
## Short form -D
#SBATCH --workdir=/path/to/directory/name
##
## Specify output file name
## If you want output and error to be written to different files
## You will need to provide output and error file names
## short form -o
#SBATCH --output=slurm-output-%N-%j.out
## %N is the name of the node on which it ran
## %j is the job-id
## NOTE this format has to be changed if Array job
## filename-%A-%a.out - where A is job ID and a is the array index
##
## Specify error output file name
## short form -e
#SBATCH --error=slurm-error-%N-%j.out
##
## Specify input file
## short form -i
## Send email
#SBATCH --mail-user=
## Email notification for the following types
#SBATCH --mail-type=BEGIN,FAIL,TIME_LIMIT_80
## Some valid types are: NONE,BEGIN,END,FAIL,REQUEUE
##
## Select partition to run this job
## Default partition is all-HiPri - run time limit is 12 hours
## short form -p
#SBATCH --partition=all-LoPri
##
## Quality of Service; Priority
## Contributor's queue needs QoS to be specified for jobs to run
## Everyone is part of normal QoS, so does not have to specified
#SBATCH --qos=normal
##
## Ask for Intel machine using Feature parameter
## short form -C
## Intel Nodes - Proc16, Proc20, Proc24
## AMD nodes - Proc64
#SBATCH --constraint="Proc24"
##
## Ask for 1 node and the number of slots in node
## This can be 16|20|24
## short form -N
#SBATCH --nodes=1
##
## Now ask for number of slots
#SBATCH --tasks-per-node=16
##
## MPI jobs
## If you need to start a 64 slot job, you can ask for 4 nodes with 16 slots each
#SBATCH --nodes 4
#SBATCH --tasks-per-node=16
##
## How much memory job needs specified in MB
## Default Memory is 2048MB
## Memory is specified per CPU
#SBATCH --mem-per-cpu=4096
##
## Load the needed modules
module load
.....
## Start the job
java –Xmx<memneeded> -jar test.jar

For MPI, Matlab, R jobs, please go the ORC-WIKI.

Which are the HADOOP nodes?

There are 18 nodes dedicated for Hadoop, compute nodes 7 to 24.  The primary name node for Hadoop is Argo-1 and the secondary node is Argo-2.  “hdfs” is configured on the local disks of these nodes giving an aggregate of ~15TB HDFS space.

Users who want to use Hadoop need to have a directory under “/user” directory in the HDFS space.  If you don’t have a directory in the HDFS space, please send email to “argo-admin@vse.gmu.edu” so we can create your directory.

The HDFS file system is mounted on the head nodes.  Users whose directory has been created in HDFS space should be able to copy files to and from their home directories using the “hadoop fs” command.  Type “hadoop fs -help” to see a list of available commands.

Is R installed on the Cluster?

Yes, R is installed on the ARGO cluster.

To use the optimized version of R, load the module “R/3.0.2“. This module will also load the “OpenBLAS/gcc/64/0.2.8” module.

To use Rmpi, you need to load the module “openmpi/gcc/64/1.6.5

 

Is Matlab installed on the cluster?

MATLAB is installed on the head node and this should be used *only* for compiling MATLAB code into executables and not for running jobs. You need to load the module “matlab/R2013a“.

See ORC Wiki for more details.

I create my script files on Windows, is there anything I need to before I use these files?

If you create script files on your Windows machine and then copy them over to ARGO cluster, make sure that you run the following command on those files before you use them:

$ dos2unix "/path/to/filename"

Windows based editors put special characters to denote line return or newline. This “dos2unix” command strips the file of these special characters and converts the file into UNIX format.

How does one get an account on the ARGO cluster?

Faculty can get an account on the ARGO cluster. Students need to be sponsored by a GMU Faculty.

  • Faculty to request an account, send email to argo-admin@vse.gmu.edu and specify your GMU netID.
  • Students: please ask your sponsoring Faculty to request on your behalf providing your name and GMU netID to the email address given above.
How to run a Hadoop job?

You should use the “hadoop.q” queue and the hadoop parallel environment to run your hadoop job.

Here is a sample script to run a hadoop example (substitute your username, and names of directories):

#!/bin/bash
# Use the current directory to start the job
#$ -cwd
# Specify the Job class
#$ -jc hadoop
# Use the hadoop parallel environment
#$ -pe hadoop 4
# Use the hadoop queue
#$ -q hadoop.q
# Name of your job
#$ -N HadoopTest
# Join the output and error files into one file
#$ -j y
# Put this line if you rerunning your job and not creating new directories
# Hadoop won't run if the directories already exist. So you can list all the
# directories to be deleted with this command
hadoop fs -rmr /user/<username>/directories-to-be-removed
# Now run the Hadoop example
hadoop jar /usr/share/hadoop/hadoop-examples-1.2.1.jar teragen 32400000 /user/<username>/<input-directory-name>
hadoop jar /usr/share/hadoop/hadoop-examples-1.2.1.jar terasort /user/<username>/<input-directory-name> /user/<username>/<output-directory-name>
hadoop jar /usr/share/hadoop/hadoop-examples-1.2.1.jar teravalidate /user/<username>/<output-directory-name> /user/<username>/<validate-directory-name>