Skip to main content Link Menu Expand (external link) Document Search Copy Copied


  • To use Hyak, you need to have have an account in a group with access
    • UW students (me) can get one by being part of the RCC RSO
  • To SSH into klone, you ssh
    • You’ll be prompted with a password auth and 2-factor
    • Do not do strenuous tasks on the login node, nothing more than text editing or file management
  • Once in the login node, you might need a ssh key to access certain jobs:
    • Use ssh-keygen -C klone -t rsa -b 2048 -f ~/.ssh/id_rsa -q -N "" to create the key
    • View the docs on how to authorize the keys
  • Port forwarding:
    • Use a command like: ssh -L PORT:HOSTNAME:PORT
    • To get the hostname, use the aptly named hostname command
  • It is also possible to do X11 forwarding, see docs for details
  • Each cluster has physically separate memory which is mounted onto each compute node
  • 3-2-1 Backup Policy:
    • 3 copies of your data
    • 2 different types of storage media
    • 1 copy off-site
  • Storage mounted on klone or mox are referred to as gscratch
    • This is because of the memory directory name, /gscratch/foldername/filename
  • Each user has a 10 GB home directory
    • Some have lab dedicated storage as well
  • There are two storage quotas, block and inode
    • Block is typical GB limit
    • Inode is a maximum amount of files
  • The hyakstorage command on klone quickly shows utilization
  • /gscratch/scrubbed is free and kinda unlimited, but files are deleted after 21 days
    • As everything is public on scrubbed it is important to set proper file permissions
  • LOLO is UW’s tape data archive solution, not necessary for me but neat
  • There are some common datasets under /gscratch/data
  • To allocate resources to everyone, a scheduler is necessary to create user processes or “jobs”
    • This is done using SLURM, “Simple Linux Utility for Resource Management”
    • As such, online documentation will often suffice
  • SLURM has two important concepts:
    • Accounts:
      • These are what you are able to submit jobs to using hyakalloc
      • Resources are what the group provides
    • Partitions:
      • Each partition is a class of node, there is a standard compute as well as GPU or high-memory nodes
      • sinfo gives all the possible partitions
  • Job Types:
    • Interactive:
      • These are interactive sessions
    • Batch:
      • These are unattended, typically once-off jobs which emails you when completed
    • Recurring:
      • These are cron-like jobs which reoccur
  • SLURM flags:
    • Account: --account
      • What account you are part of (RCC for me), can find using groups
    • Partition: --partition
      • What partition do you want to use? sinfo gives the possible ones
    • Nodes: --nodes
      • How many nodes do you want? (Typically one, esp for me)
    • Cores: --cpus-per-task
      • How many cores do you need?
    • Memory: --mem
      • How much memory do I need?
      • Given in format size[units], units are M, G, or T
    • Time: --time
      • How long do I need the job for?
      • Format is: hours:minutes:seconds, days-hours, or minutes
  • To start an interactive job use: salloc
    • This will dump you in an interactive shell
    • Example single-node interactive job: salloc -A mylab -p compute -N 1 -c 4 --mem=10G --time=2:30:00
  • Multi-node interactive jobs are more involved, see docs if necessary
  • If the group has an interactive node, you can use -p <partition_name>-int
    • You can check if you have one using hyakalloc
  • To submit batch jobs (on mox), you need to call sbatch on a <script_name>.slurm file
    • See docs for a template
  • Utilities:
    • sinfo to view (mox) partitions
      • Add -p <group_name> to see group partitions
    • squeue to view information about jobs in the queue
    • scancle cancels jobs, can either use job ID or NetID
    • sstat shows status information about a job
    • sacct displays info about completed jobs
    • sreport creates reports about previous usage
  • You have on-demand access to your group’s resources
  • You can request resources from the checkpoint partition, ckpt
    • Requests from the cluster’s idle resources
    • This can even have GPUs!
    • Checkpoint jobs are stopped and re-queued every 4 hours
    • They might be stopped without any notice
    • This means that jobs should be able to stop and resume on demand
  • For Jupyter Notebooks, select a random port number between 4096 and 16384
    • Set the flag --ip
    • Make another ssh session and port-forward in
  • We can view the available modules for software using module avail
    • This cannot be done from a login node
  • module commands:
    • module avail
    • module list
    • module load <software>
    • module unload <software>
    • module purge (unload all software)
  • These modules are from “Lmod” and “Environment Modules”
  • Apptainer is the preferred container for klone
    • They are only one file, preventing inode problems common with conda!
  • To create an Apptainer:
    1. Start an interactive session
    2. Load the apptainer module module load apptainer
    3. Create definition file: see documentation
    4. Build the apptainer container from the definition file
    5. Run the apptainer binary: apptainer exec <container> <command>
  • In practice, we can typically use pre-built containers
  • Common container app stores:
    • Cloud Library
    • Docker Hub
    • Nvidia GPU Cloud (NGC)
  • Modules can also be loaded using apptainer
  • From what I understand, venv is okay to use instead of (mini)conda