Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
services:computing:hpc [2022/05/12 11:40] calucci [Periodic Summary Reports from Slurm] enabled for new accounts |
services:computing:hpc [2024/10/28 14:42] (current) tringali Istruzioni per conda spostate in pagina dedicata |
||
---|---|---|---|
Line 7: | Line 7: | ||
- | SSH access to Ulysses v2 is provided via the login nodes at ''frontend1.hpc.sissa.it'' or ''frontend2.hpc.sissa.it'' from SISSA network or from SISSA [[:vpn|VPN]]. More access options might be made available in due time. | + | SSH access to Ulysses v2 is provided via the login nodes at ''frontend1.hpc.sissa.it'' or ''frontend2.hpc.sissa.it'' from SISSA network or from SISSA [[:vpn|VPN]]. |
===== Hardware and Software ===== | ===== Hardware and Software ===== | ||
Line 23: | Line 22: | ||
The software tree is the same you have on Linux workstations, with the same [[services:modules|Lmod modules]] system (with the only exception of desktop-oriented software packages). | The software tree is the same you have on Linux workstations, with the same [[services:modules|Lmod modules]] system (with the only exception of desktop-oriented software packages). | ||
- | A small number of POWER9-based nodes are also available (2 sockets, 16 cores, 4 threads per core; 256GB RAM) with 2 or 4 Tesla V100. Please note that you cannot run x86 code on POWER9. For an interactive shell on a P9 machine, please type ''p9login'' on frontend[12]. | + | <del>A small number of POWER9-based nodes are also available (2 sockets, 16 cores, 4 threads per core; 256GB RAM) with 2 or 4 Tesla V100. Please note that you cannot run x86 code on POWER9. For an interactive shell on a P9 machine, please type ''p9login'' on frontend[12].</del> |
===== Queue System ===== | ===== Queue System ===== | ||
Line 33: | Line 32: | ||
* **''regular1''** (old nodes) and **''regular2''** (new nodes): max 16 nodes, max 12h | * **''regular1''** (old nodes) and **''regular2''** (new nodes): max 16 nodes, max 12h | ||
- | * **''wide1''** and **''wide2''**: max 32 nodes, max 8h | + | * **''wide1''** and **''wide2''**: max 32 nodes, max 8h, max 2 concurrently running jobs per user |
- | * **''long1''** and **''long2''**: max 8 nodes, max 48h | + | * **''long1''** and **''long2''**: max 8 nodes, max 48h, max 6 concurrently running jobs per user |
* **''gpu1''** and **''gpu2''**: max 4 nodes, max 12h | * **''gpu1''** and **''gpu2''**: max 4 nodes, max 12h | ||
- | * **''power9''**: max 2 nodes, max 24h | + | * <del>**''power9''**: max 4 nodes, max 24h</del> |
<note tip>Please note that hyperthreading is enabled on all nodes (it was disabled on old Ulysses). If you **do not** want to use hyperthreading, the ''%%--hint=nomultithread%%'' options to srun/sbatch will help. | <note tip>Please note that hyperthreading is enabled on all nodes (it was disabled on old Ulysses). If you **do not** want to use hyperthreading, the ''%%--hint=nomultithread%%'' options to srun/sbatch will help. | ||
Line 45: | Line 44: | ||
Job scheduling is fair share-based, so the scheduling priority of your jobs depends on the waiting time in the queue AND on the amount of resources consumed by your other jobs. If you have urgent need to start a **single** job ASAP (e.g. for debugging), you can use the ''fastlane'' QoS that will give your job a substantial priority boost (to prevent abuse, only one job per user can use fastlane at a time, and you will "pay" for the priority boost with a lower priority for your subsequent jobs). | Job scheduling is fair share-based, so the scheduling priority of your jobs depends on the waiting time in the queue AND on the amount of resources consumed by your other jobs. If you have urgent need to start a **single** job ASAP (e.g. for debugging), you can use the ''fastlane'' QoS that will give your job a substantial priority boost (to prevent abuse, only one job per user can use fastlane at a time, and you will "pay" for the priority boost with a lower priority for your subsequent jobs). | ||
- | You //should// always use the ''%%--mem%%'' or ''%%--mem-per-cpu%%'' slurm options to specify the amount of memory needed by your job. This is especially important if your jobs doesn't use all available CPUs on a node (40 threads on IBM nodes, 64 on HP) and failing to do so will negatively impact the scheduling performance. | + | You //should// always use the ''%%--mem%%'' slurm option to specify the amount of memory needed by your job; ''%%--mem-per-cpu%%'' is also possible, but not recommended due to the scheduler configuration. This is especially important if your jobs doesn't use all available CPUs on a node (40 threads on IBM nodes, 64 on HP) and failing to do so will negatively impact the scheduling performance. |
+ | |||
+ | <note warning>Please note that ''%%--mem=0%%'' (i.e. "all available memory") is **not** recommended since the amount of memory actually available on each node may vary (e.g. in case of hardware failures).</note> | ||
<note tip> | <note tip> | ||
Line 52: | Line 53: | ||
- | ====== Simplest possible job ====== | + | ===== Simplest possible job ===== |
This is a single-core job with default time and memory limits (1 hour and 0.5GB) | This is a single-core job with default time and memory limits (1 hour and 0.5GB) | ||
Line 72: | Line 73: | ||
<note warning>Please note that MPI jobs are only supported if they allocate all available core/threads on each node (so 20c/40t on *1 partitions and 32c/64t on *2 partitions. In this context, //not supported// means that jobs using fewer cores/threads than available may or may not work, depending on how cores //not// allocated to your job are used.</note> | <note warning>Please note that MPI jobs are only supported if they allocate all available core/threads on each node (so 20c/40t on *1 partitions and 32c/64t on *2 partitions. In this context, //not supported// means that jobs using fewer cores/threads than available may or may not work, depending on how cores //not// allocated to your job are used.</note> | ||
+ | |||
+ | ==== Access to hardware-based performance counters ==== | ||
+ | |||
+ | Access to hardware-based performance counters is disabled by default for security reasons. It can be enabled on request, only for node-exclusive jobs (i.e. for allocations where a single job is allowed to run on each node), use ''sbatch -C hwperf --exclusive ...'' | ||
+ | |||
+ | ===== Using conda env for PyTorch with CUDA support ===== | ||
+ | If you want to use Python AI libraries, chances are they'll be published with conda distribution system. To understand how to use conda environments on Ulysses GPU nodes, please refer to the [[services:computing:hpc:conda|HPC conda]] page. | ||
+ | |||
===== Filesystem Usage and Backup Policy ===== | ===== Filesystem Usage and Backup Policy ===== | ||
Line 86: | Line 95: | ||
Daily backups are taken of ''/home'', while no backup is available for ''/scratch''. If you need to recover some deleted or damaged file from a backup set, please write to [[helpdesk-hpc@sissa.it]]. Daily backups are kept for one week, a weekly backup is kept for one month, and monthly backups are kept for one year. | Daily backups are taken of ''/home'', while no backup is available for ''/scratch''. If you need to recover some deleted or damaged file from a backup set, please write to [[helpdesk-hpc@sissa.it]]. Daily backups are kept for one week, a weekly backup is kept for one month, and monthly backups are kept for one year. | ||
- | Due to their inherent volatility, some directories can be excluded from the backup set. At this time, the list of excluded directories includes only one item, namely ''/home/$USER/.cache'' | + | Due to their inherent volatility, some directories can be excluded from the backup set. At this time, the list of excluded directories includes only ''/home/$USER/.cache'' and ''/home/$USER/.singularity/cache'' |
===== Job E-Mail ===== | ===== Job E-Mail ===== |