Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
services:computing:hpc [2020/05/14 15:58]
Piero Calucci
services:computing:hpc [2021/09/06 12:13] (current)
Piero Calucci [Hardware and Software]
Line 1: Line 1:
-====== High Performance Computing ======+====== High Performance Computing: Ulysses v2 ======
  
- 
- 
-===== Ulysses v2 ===== 
  
 The **Ulysses** cluster v2 is available for scientific computation to all SISSA users. If you have an active SISSA account, please write to [[helpdesk-hpc@sissa.it]] in order to have it enabled on Ulysses. The **Ulysses** cluster v2 is available for scientific computation to all SISSA users. If you have an active SISSA account, please write to [[helpdesk-hpc@sissa.it]] in order to have it enabled on Ulysses.
  
-=== Access ===+===== Access ​=====
  
  
-Ulysses v2 can be (provisionally) accessed ​via the login node at ''​frontend2.hpc.sissa.it''​ from SISSA network or from SISSA [[:​vpn|VPN]]. In the meantime, ''​frontend1''​ remains available as an access point to the old cluster. More access options might be made available in due time. +SSH access to Ulysses v2 is provided ​via the login nodes at ''​frontend1.hpc.sissa.it''​ or ''​frontend2.hpc.sissa.it''​ from SISSA network or from SISSA [[:​vpn|VPN]]. More access options might be made available in due time.
- +
-=== Hardware and Software ===+
  
 +===== Hardware and Software =====
  
 Available compute nodes include: Available compute nodes include:
Line 27: Line 23:
 The software tree is the same you have on Linux workstations,​ with the same [[services:​modules|Lmod modules]] system (with the only exception of desktop-oriented software packages). The software tree is the same you have on Linux workstations,​ with the same [[services:​modules|Lmod modules]] system (with the only exception of desktop-oriented software packages).
  
-=== Queue System ===+A small number of POWER9-based nodes are also available (2 sockets, 16 cores, 4 threads per core; 256GB RAM) with 2 or 4 Tesla V100. Please note that you cannot run x86 code on POWER9. For an interactive shell on a P9 machine, please type ''​p9login''​ on frontend[12]. 
 + 
 +===== Queue System ​===== 
  
 The queue system is now SLURM ([[https://​slurm.schedmd.com/​documentation.html]]), ​ The queue system is now SLURM ([[https://​slurm.schedmd.com/​documentation.html]]), ​
Line 38: Line 36:
   * **''​long1''​** and **''​long2''​**:​ max 8 nodes, max 48h   * **''​long1''​** and **''​long2''​**:​ max 8 nodes, max 48h
   * **''​gpu1''​** and **''​gpu2''​**:​ max 4 nodes, max 12h   * **''​gpu1''​** and **''​gpu2''​**:​ max 4 nodes, max 12h
 +  * **''​power9''​**:​ max 2 nodes, max 24h
  
-<note tip>​Please note that hyperthreading is enabled on all nodes (it was disabled on old Ulysses). If you **do not** want to use hyperthreading,​ the ''​--hint=nomultithread ​--cpu-bind=cores''​ options to srun/sbatch will help.+<note tip>​Please note that hyperthreading is enabled on all nodes (it was disabled on old Ulysses). If you **do not** want to use hyperthreading,​ the ''​%%--hint=nomultithread%%''​ options to srun/sbatch will help.
  
  
 </​note>​ </​note>​
  
-Job scheduling is fair share-based,​ so the scheduling priority of your jobs depends on the waiting time in the queue AND on the amount of resources consumed by your other jobs. If you have urgent need to start a **single** job ASAP (e.g. for debugging), you can use the ''​fastlane''​ QoS that will give your job a substantial priority boost (before you ask: to prevent abuse, only one job per user can use fastlane at a time, and you will "​pay"​ for the priority boost with a lower priority for your subsequent jobs).+Job scheduling is fair share-based,​ so the scheduling priority of your jobs depends on the waiting time in the queue AND on the amount of resources consumed by your other jobs. If you have urgent need to start a **single** job ASAP (e.g. for debugging), you can use the ''​fastlane''​ QoS that will give your job a substantial priority boost (to prevent abuse, only one job per user can use fastlane at a time, and you will "​pay"​ for the priority boost with a lower priority for your subsequent jobs). 
 + 
 +You //should// always use the ''​%%--mem%%''​ or ''​%%--mem-per-cpu%%''​ slurm options to specify the amount of memory needed by your job. This is especially important if your jobs doesn'​t use all available CPUs on a node (40 threads on IBM nodes, 64 on HP) and failing to do so will negatively impact the scheduling performance. 
 + 
 +<note tip> 
 +While you can submit a job using only ''​%%#​SBATCH --ntasks=...%%''​ it is recommended that you explicitly request a number of nodes and tasks per node (usually, all tasks that can fit in a given node) for best performance. Otherwise, your job can end up "​spread"​ on more nodes than necessary, while sharing resources with other unrelated jobs on each node. E.g. on ''​regular1'',​ ''​%%-N2 -n80%%''​ will allocate all threads on 2 nodes, while ''​%%-n80%%''​ can spread them on as many as 40 different nodes. 
 +</​note>​ 
 + 
 + 
 +====== Simplest possible job ====== 
 +This is a single-core job with default time and memory limits (1 hour and 0.5GB) 
 + 
 +<​code>​ 
 +$ cat myscript.sh 
 +#​!/​bin/​bash 
 +
 +#SBATCH -N1 
 +#SBATCH -n1 
 + 
 +echo "​Hello,​ World!"​ 
 + 
 +$ sbatch -p regular1 myscript.sh 
 +Submitted batch job 730384 
 + 
 +$ cat slurm-730384.out 
 +Hello, World! 
 +</​code>​ 
 + 
 +<note warning>​Please note that MPI jobs are only supported if they allocate all available core/​threads on each node (so 20c/40t on *1 partitions and 32c/64t on *2 partitions. In this context, //not supported// means that jobs using fewer cores/​threads than available may or may not work, depending on how cores //not// allocated to your job are used.</​note>​ 
 +===== Filesystem Usage and Backup Policy ===== 
 + 
 +''/​home''​ and ''/​scratch''​ are both general-purpose filesystems,​ they are based on the same hardware and provide the same performance level. When you first login on Ulysses, ''/​home/​$USER''​ comes pre-populated with a small number of files that provide some reasonable configuration defaults. At the same time, ''/​scratch/​$USER''​ is created for you where you have write permission. 
 + 
 +Default quotas are 200GB on ''/​home''​ and 5TB on ''/​scratch'';​ short-term users (e.g. accounts created for workshops, summer schools and other events, that usually expire in a matter of weeks) are usually given smaller quotas in agreement with workshop organizers. On special and motivated request a larger quota can be granted: please write to [[helpdesk-hpc@sissa.it]],​ Cc: your supervisor (if applicable) with your request; please note that the storage resource is limited, and not every request can be granted. 
 + 
 +<note tip> **Frequently Asked Question: what does "​quota"​ mean, exactly?**  
 +  * Quota is the traditional Unix word for and administrative limit imposed upon the usage of a certain resource. In this context, a "5TB quota" means that you will not be allowed to write more than that in a given filesystem, even if free space is available. It does not mean that those 5TB are reserved for you and are always available. At this time, filesystem quotas are enforced only on total used space; it is possible that in some future quotas are enabled on the number of files too. 
 +  * A ''​quota''​ command is available and will give you a summary of your filesystem usage. 
 +  * Please note that filesystem space is allocated in blocks, so the actual space usage can be different from the apparent file size. Currently the smallest possible block allocation is 4kB, so when computing quota usage file sizes are rounded up to the next 4kB multiple. "''​ls -s ...''"​ will report the actual allocated space for your files instead of (or along side with, depending on the command line) their apparent size. 
 +</​note>​ 
 + 
 +Daily backups are taken of ''/​home'',​ while no backup is available for ''/​scratch''​. If you need to recover some deleted or damaged file from a backup set, please write to [[helpdesk-hpc@sissa.it]]. Daily backups are kept for one week, a weekly backup is kept for one month, and monthly backups are kept for one year. 
 + 
 +Due to their inherent volatility, some directories can be excluded from the backup set. At this time, the list of excluded directories includes only one item, namely ''/​home/​$USER/​.cache''​  
 + 
 +===== Reporting Issues =====
  
-You //should// always use the ''​--mem''​ or ''​--mem-per-cpu''​ slurm options to specify the amount of memory needed by your job. This is especially important if your jobs doesn'​t use all available CPUs on a node (40 threads on IBM nodes64 on HP) and ailing ​to do so will negatively impact ​the scheduling performance.+When reporting issues with Ulyssesplease keep to the following guidelines:
  
 +  * write to [[helpdesk-hpc@sissa.it]],​ not to personal email addresses: this way your enquiry will be seen by more than one person
 +  * please use a clear and descriptive subject for your message: "​missing library libwhatever.so.12 from package whatever-libs"​ is OK, "​missing software"​ is less useful, "​Ulysses issues"​ is definitely not useful
 +  * please open one ticket for each issue; **do not** reply to old, closed tickets for unrelated issues
 +  * if you have some issues with submitting and running jobs, please:
 +    * include the **job script** and the exact submission **command line** you are using
 +    * include the **job IDs** of all jobs involved
 +    * if any modules are loaded before job submission and if your ''​.bashrc''​ or ''​.bash_profile''​ include anything but the system defaults, please state clearly so
 +    * please state clearly if you encountered the issue only once, or sporadically,​ or if it can be reproduced (and how)
 +    * if files in your home or scratch directories are involved and you want us to look at them, please include the complete path and an explicit permission for us to look into them
 +  * if you are asking for the installation of new software packages, please explain why a system-wide installation is needed
 +    * if you "​only"​ want to make available to others a piece of software you already use, we can make room for you in ''/​opt/​contrib''​ and help in the creation of a suitable module
 +    * we cannot install proprietary software unless a suitable license is provided