Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
services:computing:hpc [2021/05/27 09:03] calucci [Simplest possible job] warning about MPI jobs |
services:computing:hpc [2022/03/08 15:01] calucci summary reports |
||
---|---|---|---|
Line 7: | Line 7: | ||
- | Ulysses v2 can be accessed via the login nodes at ''frontend1.hpc.sissa.it'' or ''frontend2.hpc.sissa.it'' from SISSA network or from SISSA [[:vpn|VPN]]. More access options might be made available in due time. | + | SSH access to Ulysses v2 is provided via the login nodes at ''frontend1.hpc.sissa.it'' or ''frontend2.hpc.sissa.it'' from SISSA network or from SISSA [[:vpn|VPN]]. More access options might be made available in due time. |
===== Hardware and Software ===== | ===== Hardware and Software ===== | ||
Line 22: | Line 22: | ||
The software tree is the same you have on Linux workstations, with the same [[services:modules|Lmod modules]] system (with the only exception of desktop-oriented software packages). | The software tree is the same you have on Linux workstations, with the same [[services:modules|Lmod modules]] system (with the only exception of desktop-oriented software packages). | ||
+ | |||
+ | A small number of POWER9-based nodes are also available (2 sockets, 16 cores, 4 threads per core; 256GB RAM) with 2 or 4 Tesla V100. Please note that you cannot run x86 code on POWER9. For an interactive shell on a P9 machine, please type ''p9login'' on frontend[12]. | ||
===== Queue System ===== | ===== Queue System ===== | ||
Line 34: | Line 36: | ||
* **''long1''** and **''long2''**: max 8 nodes, max 48h | * **''long1''** and **''long2''**: max 8 nodes, max 48h | ||
* **''gpu1''** and **''gpu2''**: max 4 nodes, max 12h | * **''gpu1''** and **''gpu2''**: max 4 nodes, max 12h | ||
+ | * **''power9''**: max 2 nodes, max 24h | ||
<note tip>Please note that hyperthreading is enabled on all nodes (it was disabled on old Ulysses). If you **do not** want to use hyperthreading, the ''%%--hint=nomultithread%%'' options to srun/sbatch will help. | <note tip>Please note that hyperthreading is enabled on all nodes (it was disabled on old Ulysses). If you **do not** want to use hyperthreading, the ''%%--hint=nomultithread%%'' options to srun/sbatch will help. | ||
Line 68: | Line 71: | ||
</code> | </code> | ||
- | <note warning>Please note that MPI jobs are only supported if they allocate all available core/threads on each node (so 20c/40t on *1 partitions and 32c/64t on *2 partition. In this context, //not supported// means that jobs using fewer cores/threads than available may or may not work, depending on how cores //not// allocated to your job are used.</note> | + | <note warning>Please note that MPI jobs are only supported if they allocate all available core/threads on each node (so 20c/40t on *1 partitions and 32c/64t on *2 partitions. In this context, //not supported// means that jobs using fewer cores/threads than available may or may not work, depending on how cores //not// allocated to your job are used.</note> |
===== Filesystem Usage and Backup Policy ===== | ===== Filesystem Usage and Backup Policy ===== | ||
Line 84: | Line 87: | ||
Due to their inherent volatility, some directories can be excluded from the backup set. At this time, the list of excluded directories includes only one item, namely ''/home/$USER/.cache'' | Due to their inherent volatility, some directories can be excluded from the backup set. At this time, the list of excluded directories includes only one item, namely ''/home/$USER/.cache'' | ||
+ | |||
+ | ===== Job E-Mail ===== | ||
+ | You can enable e-mail notifications at various stages of each job life with the ''--mail-type=TYPE'' option where ''TYPE'' can be a comma-separated list such as ''BEGIN,END,FAIL'' (more details are available in ''man sbatch''). Notification recipient is by default your SISSA e-mail address, but you can select a different address with ''--mail-user''. **End-job** notification includes a summary of consumed resources (CPU time and memory) as absolute values and as a percentage of requested resources. Please note that memory usage is sampled at 30 seconds intervals, so if your job is terminated by an out-of-memory condition arising from a very large failed allocation, the reported value can be grossly underestimated. | ||
+ | ==== Energy Accounting ==== | ||
+ | An experimental energy accounting system has been enabled on Ulysses, and energy usage estimates are reported in end-job notification. This is intended as a very rough estimate of the energy impact your job has, but is **not** accurate enough to be used for proper cost/energy/environmental accounting. Known limits of the energy accounting system in use include: | ||
+ | * very small values are completely unreliable (and are not included at all in the end-job notification, so in case of very short or "mostly idle" job you will find no value at all) | ||
+ | * only CPU and memory energy usage are considered, while energy consumed by other devices (network cards, disk controllers, service processors, power supplies) is not accounted for; energy used "outside" the compute nodes is not considered as well (this include network devices, external storage, UPS, HVAC), so even for a CPU-intensive job the "real" energy consumption can easily be twice as much than reported | ||
+ | * on the other side, //if your job doesn't use all available cores on each allocated node//, energy consumption can be overestimated | ||
+ | |||
+ | ===== Periodic Summary Reports from Slurm ===== | ||
+ | |||
+ | You can enable the generation of periodic reports on your cluster usage that will be delivered to your email address on a daily, weekly and/or monthly base. | ||
+ | |||
+ | Each summary reports includes the number of jobs that completed their lifecycle during the selected interval along with the total amount of CPU*hours consumed and and estimation of total energy consumption; the number of jobs in each partition; and the final states of completed jobs (usually one of ''COMPLETED'', ''TIMEOUT'', ''CANCELLED'', ''FAILED'' or ''OUT_OF_MEMORY''). Optionally a detailed listing of all jobs can be included as an attachment (this will be a Zip-ed CSV file that can be further processed with your software of choice, but it is also human-readable). | ||
+ | |||
+ | To enable the reports with the default options (no daily report; weekly report with jobs detail and monthly report delivered to your_username@sissa.it) just create an empty ''.slurm_report'' file in your home directory on Ulysses: | ||
+ | <code> | ||
+ | touch $HOME/.slurm_report | ||
+ | </code> | ||
+ | |||
+ | If you need to tune some parameters (e.g. enable daily reports, enable/disable job details, change mail delivery address), please copy the default configuration file to your home | ||
+ | <code> | ||
+ | cp /usr/local/etc/slurm_report.ini $HOME/.slurm_report | ||
+ | </code> | ||
+ | and edit the local copy. If your account has no "@sissa.it" email, it is recommended that you edit the ''mailto='' line. | ||
+ | |||
+ | ==== How to read the detailed report ==== | ||
+ | |||
+ | The detailed report, if requested, is attached as a Zip-compressed CSV file. You should be able to open / decompress it on any modern computing platform and the CSV file is both human- and machine-readable. Timestamps are in ISO 8601 format with implicit local time zone YYYY-MM-DDThh:mm:ss, e.g. 2022-03-04T09:30:00 is "half past nine in the morning of March 4th, 2022". Four timestamps are provided for each job: **submit** (when the job was created with sbatch or similar commands), **eligible** (when the job becomes runnable, i.e. there are no conflicting conditions, like dependency on other jobs or exceeded user limits), **start** and **end** (when the job actually begins and ends execution). | ||
===== Reporting Issues ===== | ===== Reporting Issues ===== | ||
Line 89: | Line 121: | ||
When reporting issues with Ulysses, please keep to the following guidelines: | When reporting issues with Ulysses, please keep to the following guidelines: | ||
- | * write to [[helpdesk-hpc@sissa.it]], not to personal email addresses: this way your enquiry will be seen by more than one person | + | * write to [[helpdesk-hpc@sissa.it]], not to personal email addresses: this way your request will be seen by more than one person |
* please use a clear and descriptive subject for your message: "missing library libwhatever.so.12 from package whatever-libs" is OK, "missing software" is less useful, "Ulysses issues" is definitely not useful | * please use a clear and descriptive subject for your message: "missing library libwhatever.so.12 from package whatever-libs" is OK, "missing software" is less useful, "Ulysses issues" is definitely not useful | ||
* please open one ticket for each issue; **do not** reply to old, closed tickets for unrelated issues | * please open one ticket for each issue; **do not** reply to old, closed tickets for unrelated issues |