HPE Performance Cluster Manager 1.11.0 Release Notes ============================================================================== Copyright (c) 2018-2024 Hewlett Packard Enterprise Development LP. All rights reserved. Notices ------------------------------------------------------------------------------ The information contained herein is subject to change without notice. The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. Hewlett Packard Enterprise shall not be liable for technical or editorial errors or omissions contained herein. Confidential computer software. Valid license from Hewlett Packard Enterprise required for possession, use, or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. Links to third-party websites take you outside the Hewlett Packard Enterprise website. Hewlett Packard Enterprise has no control over and is not responsible for information outside the Hewlett Packard Enterprise website. Acknowledgments Library ------------------------------------------------------------------------------ This library topic contains elements you can conref for the frontmatter Acknowledgments topic. For information about trademarks and how to acknowledge them, see the Legal website. Microsoft(R) and Windows(R) are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. Java(R) and Oracle(R) are registered trademarks of Oracle and/or its affiliates. Linux(R) is the registered trademark of Linus Torvalds in the U.S and other countries. Red Hat(R) and RPM(R) are trademarks of Red Hat, Inc. in the United States and other countries. ARM(R) is a registered trademark of ARM Limited in the United States and other countries. SUSE(R) is a registered trademark of SUSE LLC in the United States and other countries. Ubuntu(R) is a registered trademark of Canonical Ltd. Altair(R) and Altair PBS Professional(R) are registered trademarks of Altair Engineering, Inc. ****************************************************************************** Contents ****************************************************************************** 1.0 Overview 2.0 Getting Started 2.1 Distribution Media and Software Documentation 2.2 Operating System Support 2.3 Hardware Requirements 2.4 Software Licensing Information 2.5 Electronic Software Delivery 2.6 Warranty 2.7 HPE Software Support 3.0 New Features and Improvements 3.1 New Operating System Support 3.2 Cluster Health Check Additions and Enhancements 3.3 Cluster Discovery, Networking and Configuration Improvements 3.4 General Additions and Enhancements 3.5 Monitoring Improvements 3.6 Scalable Unit (SU) Leader Nodes Improvements 3.7 Quorum High Availability Setup Improvements 3.8 Power Consumption Management Improvements 3.9 Documentation Updates 3.10 Deprecated Features 3.11 Future Deprecations 4.0 Known Issues and Workarounds 4.1 Upgrade 4.2 Installation 4.3 Networking 4.4 High Availability 4.5 SU-Leader 4.6 System Config and Discovery 4.7 Monitoring 4.8 Command Line Interface (CLI) 4.9 Graphical User Interface (GUI) 4.10 Diags and Firmware 4.11 Miscellaneous 4.12 Ubuntu 5.0 Feedback 6.0 Appendix 6.1 Notes on Using Unsupported or Unmanaged Network Switches with HPCM 6.2 Supported Power Distribution Units (PDUs) 6.3 HPE Power and Cooling Infrastructure Monitor (PCIM) Supported Devices 6.4 HPCM Update Repository Guide 6.5 List of CASTs Addressed in HPCM 1.11.0 6.6 List of Issues Addressed in HPCM 1.11.0 ****************************************************************************** 1.0 Overview ****************************************************************************** HPE Performance Cluster Manager delivers an integrated system management solution for Linux(R)-based high performance computing (HPC) clusters. HPE Performance Cluster Manager provides complete provisioning, management, and monitoring for clusters scaling to 100,000 nodes. The software enables fast system setup from bare-metal, comprehensive hardware monitoring and management, image management, software updates and power management. HPE Performance Cluster Manager reduces the time and resources spent administering HPC systems - lowering total cost of ownership, increasing productivity and providing a better return on hardware investments. Initial system setup involves installation of software including the Linux operating system on the administrative node, discovery of hardware components for the cluster nodes and operating system provisioning for all the compute and service nodes in the cluster. HPE Performance Cluster Manager can quickly provision a cluster with thousands of nodes from bare metal - typically within an hour. In addition, new cluster nodes being added to the existing cluster are automatically discovered and configured without requiring system shutdown. Hardware management is comprehensive and secure. HPE Performance Cluster Manager collects telemetry from the cluster nodes and stores them in a secure repository. System administrator tasks on the administrative nodes are kept secure from end-user access. When issues are detected, alerts are sent to the attention of the system administrator via the console (GUI, CLI) and by email. The system administrator can setup automatic reactions to specific alerts such as power capping when a specific temperature is reached in the data center. Additional analyses of the hardware metrics can be done by visualizing the metrics at a specific point in time or over a historical period. The installed software including the bios on the cluster nodes can be compared and flagged for any inconsistencies with versions or missing items. Integrated firmware flashing supports flashing of bios, BMC/iLO, CMC, network adapters and switches. The HPE Performance Cluster Manager image management system supports a secure software image repository that stores software in multiple formats including RPM, ISO, remote repository and gold image. Software stored in the image repository can be multiple versions of Linux operating system or other software such as middleware and other applications. Each software image has version control accountability built-in to track changes such as the software image version, who made the change and date of last change. Any software image in the repository can be installed on-demand on a cluster node or set of cluster nodes and restored to the original software environment as required. For power management, HPE Performance Cluster Manager offers tools for accurate measurement and prediction of power usage for better capacity planning. Step-by-step topology and protocol-aware Power On/Off feature enables controlled start and shut-down of the clustered system. For example, power-on order is rack, chassis, cluster node and power-off order is cluster node, chassis, rack. Power telemetry is collected in watts for rack AC, bulk DC, cluster nodes and liquid cooling infrastructure. The metrics can be saved for analysis and historical reference. In addition, for the HPE SGI 8600 system, HPE Performance Cluster Manager supports advanced power management features for power capping and power resource management for jobs with the Altair PBS Professional Power Awareness feature. HPE Apollo systems require Apollo Platform Manager (purchased separately) for power capping and rack management. HPE Performance Cluster Manager provides a comprehensive cluster management environment providing resiliency, security, operational efficiency and scale for HPE Apollo, HPE Cray, HPE Cray EX, HPE Cray XD, SGI and HPE ProLiant high performance computing clusters. ****************************************************************************** 2.0 Getting Started ****************************************************************************** 2.1 Distribution Media and Software Documentation ------------------------------------------------------------------------------ HPE provides the HPE Performance Cluster Manager software and documentation is available on as an electronic software download and on physical media. Order the HPE Performance Cluster Manager Media SKU (Q9V62A) to order the physical DVD media. To download the software, visit the "My HPE Software Center" site at: http://www.hpe.com/downloads/software If you have not associated your Support Agreement ID (SAID) with your HPE Account, you may need to enter your SAID in order to search for the HPE Performance Cluster Manager software. Customers may download the software and corresponding documentation from the specified URL provided at time of delivery. Individual product media files or ISO files are described in the "ISO File Descriptions" section on the "Release Notes" tab on the HPCM 1.9 product release page on the HPE Support Center. Additional documentation can be downloaded from www.hpe.com/software/hpcm Patches for the HPE Performance Cluster Manager product are published to HPE's Software Delivery Repository at the following location: https://update1.linux.hpe.com/repo/hpcm/ To subscribe to patches, follow the instructions on the project page: https://downloads.linux.hpe.com/SDR/project/hpcm/ In order to see updates on the Software Delivery Repository, you will need to have an HPE Account linked with any applicable HPE service agreements. See the section "6.4 HPCM Update Repository Guide" in the Appendix for more details. 2.2 Operating System Support ------------------------------------------------------------------------------ HPE Performance Cluster Manager supports the SUSE Linux Enterprise Server (SLES), Red Hat Enterprise Linux (RHEL), HPE Cray Operating System (COS), Rocky Linux, and Tri-Lab Operating System Stack (TOSS), and CentOS Linux releases noted below. HPCM can manage clusters in which all nodes run the same operating system release, a multi-distro cluster in which compute nodes run a different operating system release than the system management nodes (i.e., admin and leader), or a multi-distro cluster in which compute nodes run a variety of different operating system releases. Review the following details to see which specific operating system releases are tested and supported on the various node types and architectures: - x86_64 o admin and leader: RHEL/Rocky8.9, SLES15SP5 o compute/service : RHEL/Rocky8.8, RHEL/Rocky8.9, RHEL/Rocky9.2, RHEL/Rocky9.3 SLES15SP4, SLES15SP5, COS 2.4, COS 2.5, COS 23.11 TOSS 4.6, TOSS 4.7 Ubuntu 22.04.3 - aarch64 o compute/service : RHEL8.9, RHEL9.3 SLES15SP5, COS 23.11 TOSS 4.7 Additional Notes: [1] Aarch64 support was re-introduced in the HPCM 1.10 release. Refer to system documentation for supported operating systems. [2] HPE performed the majority of testing and validation with Mellanox OFED versions 23.10-1.1.9.0 (on latest chips) and 4.9-4.0.8.0 (on legacy chips), OPA 10.11.0.1.2 on supported distros. [3] HPE tested and validated HPCM 1.11 with HPE Cray Programming Environment 24.03 and Slingshot 2.2. [4] CentOS 8.x is no longer supported as the community has moved to the CentOS stream model. HPE suggests using Rocky Linux 8.x as an alternative to the previous CentOS 8.x releases. [5] HPCM 1.8 was the last release to support SLES12 and RHEL/CentOS7 releases on compute nodes. 2.3 Hardware Requirements ------------------------------------------------------------------------------ HPE Performance Cluster Manager software is supported on the following Gen9, Gen10, Gen10+ and Gen11 platforms: - SGI 8600 - HPE Apollo 2000, 4000, 6000, 6500 and 9000 systems - HPE Apollo 20 (including CLX-AP) and 40 systems - HPE ProLiant DL 325 / 345 / 360 / 380 / 385 / 580 servers - HPE Apollo 70 system - HPE Apollo 80 system - HPE Apollo 35 server - HPE Cray XD2000, XD6500 systems o XD220v, XD224, XD225v, XD295v o XD665, XD670 - HPE Cray EX Supercomputers o HPE Cray EX235a, EX235n, EX255a, EX420, EX425, EX4252 o HPE Cray EX2500 (chassis, compute blades, switch chassis, and CDU) o HPE Cray EX3000 (chassis, compute blades, switch chassis, and CDU) o HPE Cray EX4000 (chassis, compute blades, switch chassis, and CDU) - Superdome Flex Family 2.4 Software Licensing Information ------------------------------------------------------------------------------ For the Software to be valid on an HPE cluster, each server in the HPE cluster must have a valid HPE Performance Cluster Manager license. Subject to the terms and conditions of this Agreement and the payment of any applicable license fee, HPE grants a non-exclusive, non-transferable license to use (as defined below), in object code form, one copy of the Software on one device (server or node) at a time for internal business purposes, unless otherwise indicated above or in applicable Transaction Document(s). "Use" means to install, store, load, execute and display the Software in accordance with the Specifications. Use of the Software is subject to these license terms and to the other restrictions specified by Hewlett Packard Enterprise in any other tangible or electronic documentation delivered or otherwise made available with or at the time of purchase of the Software, including license terms, warranty statements, Specifications, and "readme" or other informational files included in the Software itself. Such restrictions are hereby incorporated in this Agreement by reference. Some Software may require license keys or contain other technical protection measures. HPE reserves the right to monitor compliance with Use restrictions remotely or otherwise. Hewlett Packard Enterprise may make a license management program available which records and reports license usage information, If so supplied, customer agrees to install and run such license management program beginning no later than one hundred and eighty (180) days from the date it is made available and continuing for the period that the Software is Used. Other terms of the HPE Software License are provided on the license agreement that is delivered with the HPE Performance Cluster Manager software. 2.5 Electronic Software Delivery ------------------------------------------------------------------------------ Electronic software is available. Hewlett Packard Enterprise recommends purchasing electronic products over physical products when available for faster delivery and the convenience of not having to manage confidential paper licenses. 2.6 Warranty ------------------------------------------------------------------------------ Hewlett Packard Enterprise will replace defective delivery media for a period of 90 days from the date of purchase. This warranty applies to all HPCM products found on the delivery media. 2.7 HPE Software Support ------------------------------------------------------------------------------ HPE Services leverages our breadth and depth of technical expertise and innovation to help accelerate digital transformation with Advisory, Professional, and Operational Services. There is a full range of services to complement HPE Performance Cluster Manager software from advisory and design, benchmarking and tuning services, factory pre-installation, configuration, and acceptance as well as training and operational services. Advisory Services includes design, strategy, road map, and other services to help enable the digital transformation journey, tuned to IT and business needs. Advisory Services helps customers on their journey to Hybrid IT, Big Data, and the Intelligent Edge. Professional Services helps integrate the new solution with project management, installation and startup, relocation services, and more. In addition, Factory Express installs the software in the factory when building the system. HPE Education Services helps train staff using and managing the software and other technology. We help mitigate risk to the business, so there is no interruption when new technology is being integrated into the existing IT environment. Operational Services: - HPE Flexible Capacity is a new consumption model to manage on-demand capacity, combining the agility and economics of public cloud with the security and performance of on-premises IT. - HPE Datacenter Care offers a tailored operational support solution built on core deliverables. It includes hardware and software support, a team of experts to help personalize deliverables and share best practices, as well as optional building blocks to address specific IT and business needs. HPE Datacenter Care for Hyperscale gives customers access to the Hyperscale Center of Excellence with technical experts who understand how to manage IT at scale including the software. - HPE Proactive Care is an integrated set of hardware and software support including an enhanced call experience with start to finish case management helping resolve incidents quickly and keeping IT reliable and stable. - HPE Foundation Care helps when there is a hardware or software problem offering several response levels dependent on IT and business requirements. HPE Software Support offers a number of additional software support services, many of which are provided to our customers at no additional charge. HPE Performance Cluster Manager Software Technical Support and Update Service ----------------------------------------------------------------------------- Software products include three years of 24 x 7 HPE Software Technical Support and Update Service. This service provides access to Hewlett Packard Enterprise technical resources for assistance in resolving software implementation or operations problems. The service also provides access to software updates and reference manuals in electronic form. - To download product update releases: My HPE Software Center: www.hpe.com/downloads/software - To learn more about accessing support materials: HPE Support Center: www.hpe.com/support/AccessToSupportMaterials - To subscribe to eNewsletters and alerts: HPE Email Preference Center: www.hpe.com/support/e-updates IMPORTANT: Access to some online resources requires product entitlement. You must have an HPE Account setup with relevant support agreement IDs and product entitlements. Your HPE Account is the new identity and access management infrastructure service for HPE's customers and partners; it replaces the HPE Passport. Registration for Software and Technical Support and Update Services ------------------------------------------------------------------------------ If you received a license entitlement certificate, registration for this service will take place following online redemption of the license certificate/key. How to Use Your Software Technical Support and Update Service ------------------------------------------------------------------------------ Once registered, you will receive a service contract in the mail containing the Customer Service phone number and your Service Agreement Identifier (SAID). You will need your SAID when calling for technical support. Using your SAID, you can also go to the HPE Support Center web page to view your contract online. Sign Up for Product Alerts ------------------------------------------------------------------------------ To setup product alerts, follow the steps outlined below: 1) Login with your HPE Account to the HPE Support Center (support.hpe.com) 2) Hover mouse of the menu icon (3 horizontal lines) next to Support Center 3) Hover mouse over products and select "Sign up for Product Alerts" 4) On the page titled "Get connected with updates from HPE", enter required information and search for "HPE Performance Cluster Manager" in Step 1 of the "Products" section 5) Select both "HPE Performance Cluster Manager" and "HPE Performance Cluster Manager Licenses" in Step 3 of the "Products" section, and click the "Add selected products" button 6) Click on the large "Subscribe" button Contacting HPE regarding HPE Performance Cluster Manager ------------------------------------------------------------------------------ Hewlett Packard Enterprise addresses cluster manager questions at the asset- solution level or at the serial-number level, as follows: - Hewlett Packard Enterprise provides solution-level services to HPE products that are designated with a Base System Code. Examples include the following: o HPE Cray EX systems o HPE Cray XD 6500 systems o Other configurations that include HPE Slingshot networking Use the asset solution serial number to open cluster manager service requests. This is the single serial number typically used to open technical service cases for any software, hardware, interconnect (networking), or cooling question. - Serial number services originate using the individual server serial number level. Typically, this is the serial number of the cluster admin node. Join the Conversation ------------------------------------------------------------------------------ The HPE Community forum is a community-based, user-supported tool for Hewlett Packard Enterprise customers to participate in discussions with the customer community about Hewlett Packard Enterprise products (community.hpe.com). Websites ------------------------------------------------------------------------------ +------------------------------------------------------------------------+ | Website | Link | |---------------------------+--------------------------------------------| | HPE Performance Cluster | www.hpe.com/software/hpcm | | Manager | | |---------------------------+--------------------------------------------| | My HPE Software Center | www.hpe.com/downloads/software | |---------------------------+--------------------------------------------| | Hewlett Packard | www.hpe.com/support/hpesc | | Enterprise Support Center | | |---------------------------+--------------------------------------------| | Contact Hewlett Packard | www.hpe.com/assistance | | Enterprise Worldwide | | |---------------------------+--------------------------------------------| | HPE Services | www.hpe.com/services | |---------------------------+--------------------------------------------| | Subscription | www.hpe.com/support/e-updates | | Service/Support Alerts | | |---------------------------+--------------------------------------------| | HPE Performance Cluster | downloads.linux.hpe.com/SDR/project/hpcm/ | | Manager SDR Information | | +------------------------------------------------------------------------+ ****************************************************************************** 3.0 Features and Improvements ****************************************************************************** The following sections highlight some of the features of the HPE Performance Cluster Manager product. Due to differences in platform hardware and firmware, some HPCM features are not available on every supported platform. Please note the exceptions noted in the following table: +-------------------------------------------------------------------------------------+ | Platform | Image Mgmt & | Monitoring | Mgmt Network | BIOS | BMC/iLO | | | Provisioning | | Switch Mgmt | Flashing | Flashing | |--------------------+--------------+------------+--------------+----------+----------| | SGI 8600 | Yes | Yes | Yes | Yes [1] | Yes | |--------------------+--------------+------------+--------------+----------+----------| | Proliant DL325/345 | Yes | Yes | Yes | Yes | Yes [2] | | 360/380/385/580 | | | | | | |--------------------+--------------+------------+--------------+----------+----------| | Apollo 2000 Nodes | Yes | Yes | Yes | Yes | Yes [2] | |--------------------+--------------+------------+--------------+----------+----------| | Apollo 4000 Nodes | Yes | Yes | Yes | Yes | Yes [2] | |--------------------+--------------+------------+--------------+----------+----------| | Apollo 6000 Nodes | Yes | Yes | Yes | Yes | Yes [2] | |--------------------+--------------+------------+--------------+----------+----------| | Apollo 6500 Nodes | Yes | Yes | Yes | Yes | Yes [2] | |--------------------+--------------+------------+--------------+----------+----------| | Apollo 9000 Nodes | Yes | Yes | Yes | Yes | Yes | |--------------------+--------------+------------+--------------+----------+----------| | Apollo 20 (kl20) | Yes | Yes | Yes | Yes | Yes | |--------------------+--------------+------------+--------------+----------+----------| | Apollo 20 (CLX-AP) | Yes | Yes | Yes | Yes | Yes | |--------------------+--------------+------------+--------------+----------+----------| | Apollo 20 (kl20) | Yes | Yes | Yes | Yes | Yes | |--------------------+--------------+------------+--------------+----------+----------| | Apollo 40 (sx40) | Yes | Yes | Yes | Yes | Yes | |--------------------+--------------+------------+--------------+----------+----------| | Apollo 40 (pc40) | Yes | Yes | Yes | Yes | Yes | |--------------------+--------------+------------+--------------+----------+----------| | Apollo 35 | Yes | Yes | Yes | No | Yes | |--------------------+--------------+------------+--------------+----------+----------| | Cray EX2500 | N/A | Yes | Yes | Yes (cC) | Yes (nC) | |-------------------------------------------------------------------------------------| | Cray EX3000 | N/A | Yes | Yes | Yes (cC) | Yes (nC) | |--------------------+--------------+------------+--------------+----------+----------| | Cray EX4000 | N/A | Yes | Yes | Yes (cC) | Yes (nC) | |--------------------+--------------+------------+--------------+----------+----------| | Cray EX235a | Yes | Yes | Yes | Yes | Yes (nC) | |--------------------+--------------+------------+--------------+----------+----------| | Cray EX235n | Yes | Yes | Yes | Yes | Yes (nC) | |--------------------+--------------+------------+--------------+----------+----------| | Cray EX255a | Yes | Yes | Yes | Yes | Yes (nC) | |--------------------+--------------+------------+--------------+----------+----------| | Cray EX420 | Yes | Yes | Yes | Yes | Yes (nC) | |--------------------+--------------+------------+--------------+----------+----------| | Cray EX425 | Yes | Yes | Yes | Yes | Yes (nC) | |--------------------+--------------+------------+--------------+----------+----------| | Cray EX4252 | Yes | Yes | Yes | Yes | Yes (nC) | |--------------------+--------------+------------+--------------+----------+----------| | Superdome Flex 280 | Yes | Yes | Yes | No | No | |--------------------+--------------+------------+--------------+----------+----------| | Cray XD224 | Yes | No [4] | Yes | No [4] | No [4] | |--------------------+--------------+------------+--------------+----------+----------| | Cray XD6500/X665 | Yes | Yes | Yes | Yes | Yes | |--------------------+--------------+------------+--------------+----------+----------| | Cray XD6500/X665 | Yes | Yes | Yes | Yes | Yes | |--------------------+--------------+------------+--------------+----------+----------| | Cray XD2000/X225v | Yes | Yes | Yes | Yes | Yes | |--------------------+--------------+------------+--------------+----------+----------| | Cray XD2000/X295v | Yes | Yes | Yes | Yes | Yes | |--------------------+--------------+------------+--------------+----------+----------| | Cray XD2000/X220v | Yes | Yes | Yes | Yes | Yes | +-------------------------------------------------------------------------------------+ [1] On the SGI 8600, HCAs can also be flashed. [2] The Service Pack for Proliant (SPP) may be used to flash the firmware on the HPE Proliant DL and Apollo systems. [3] Power related information has been moved from this table into the HPE Performance Cluster Manager Power Consumption Guide. [4] Monitoring and firmware flashing of the XD224 will be addressed in a patch to HPCM 1.11 when the final platform firmware is available. The HPE Performance Cluster Manager 1.11 release includes several improvements in the following areas: - Improved support for the HPE Cray XD6500 XD670 platforms (HPCM-2957) o Added power and bios setting support (HPCM-5555) o Added firmware flashing support (HPCM-5442) o Added monitoring support (HPCM-5556) - Partial support* for HPE Cray XD224 platform (HPCM-5151) * Able to discover, pxe boot, provision and do basic power management. 3.1 New Operating System Support ------------------------------------------------------------------------------ Operating system support has been updated to include the following: - Red Hat Enterprise Linux 8.9 (admin/leader/compute/service nodes) - Rocky Linux 8.9 (admin/leader/compute/service nodes) - Red Hat Enterprise Linux 9.3 (compute/service nodes) - Rocky Linux 9.3 (compute/service nodes) - HPE Cray Operating System 23.11 (compute/service nodes) - Tri-Lab Operating System Stack (TOSS) 4.7 - Ubuntu 22.04.3 (compute/service nodes) For the complete list of supported operating systems, see the notes in section "2.2 Operating System Support" above. HPCM support for Ubuntu on any given compute platform is contingent on the platform itself being supported on Ubuntu (see the HPE Servers Support & OS Certification Matrix for Ubuntu for details): https://techlibrary.hpe.com/us/en/enterprise/servers/supportmatrix/ubuntu.aspx 3.2 Cluster Health Check and HW Triage Toolkit Additions and Enhancements ------------------------------------------------------------------------------ This release includes several changes related to improved support of platforms both in diagnostics and in the hardware triage toolkit (HTT), as well as fixes to cluster health check include the following: - Adds support for the EX235n, EX254n, EX255a, EX425, and EX4252 platforms to HTT (HPCM-5987) - Includes a statically compiled gpu_sizzle in the diags (HPCM-5495) - AMD gpu diagnostics now based on ROCm 6.0 (HPCM-6189) - Adds several new and/or updated gpu-based diagnostics (HPCM-2427,HPCM-2428, HPCM-5112,HPCM-5141) 3.3 Cluster Discovery, Networking and Configuration Improvements ------------------------------------------------------------------------------ This release includes several changes related to discovery and system configuration designed to aid in system setup, including the following: - New Hardware Support (HPCM-2957,HPCM-5151) This release improves support for the HPE Cray XD6500 XD670 platform and introduces basic support for the XD224 platform. - Adds support for LUKS2 security on diskful nodes (HPCM-5672) HPCM 1.11 introduces support for LUKS2 security on any diskful x86_64 nodes equipped with a trusted platform 2 (TPM2) device. For more information, see the section "Enabling and managing security on a disk enabled with Linux unified key setup 2 (LUKS2)" in the HPCM Administration Guide, the section "Cluster definition file example - Entries for service nodes that enable Linux Unified Key Setup 2 (LUKS2) security" in the Installation Guide, and in section "4.4.5 Enabling LUKS2 Security on Q-HA Physcial Nodes" of these release notes. You do not need to enable LUKS2 security on the admin node in order to enable LUKS2 security on other nodes. - Validates qualified updated firmware revisions for network switches: o Aruba switches (HPCM-5564) o HPE FlexFabric/FlexNetwork switches (HPCM-5565) See section 6.1 in the Appendix for details. - Improves network setup in configure-cluster (HPCM-4638) The configure-cluster tool will now prompt users during the initial interface setup menu to set the management IPs on the head and head-bmc networks, and on the admin node's interfaces. 3.4 General Additions and Enhancements ------------------------------------------------------------------------------ This release introduces several new enhancements to improve performance and easy of use including: - Instructions in Upgrade Guide to Prevent EPEL Conflicts (HPCM-5089) EPEL is an optional repository that provides several packages that conflict with or provide newer versions of packages in HPCM 1.10. The HPCM Upgrade Guide now includes instructions on how version lock those packages to prevent conflicts. - Adds support for iSCSI provisioning (HPCM-5608,HPCM-5803) iSCSI has been added as a rootfs option in addition to disk, tmpfs, and NFS. Typically, iSCSI provisioning methods are higher-performing than NFS provisioning methods because iSCSI provisioning methods connect to the rootfs file system at the block level, while NFS adds another layer to the file system. - Cluster manager command line interface improvements, including: o sudo user name captured in cm.log (HPCM-1680) o 'cm image show' now includes image size information (HPCM-4379) o adds chassis type to 'cm controller' command (HPCM-4759) - Adds support for setting console rw/ro permissions (HPCM-3590,HPCM-5428) This change allows the HPCM admin to set, unset, and show which users are allowed either read-write or read-only access, using built-in Conserver capabilities. Permissions may be set globally, so that all consoles have the same permission lists, or each node can be set individually. If a node is unset, it goes back to the global setting (if set) or the initial defaults (if not set). - Makes critical services more available during clone-slot (HPCM-5599) Critical services (e.g., database, power, config management, etc.) are now left running rather than turned off during the clone-slot operation ensuring that those services are available. At the end of the clone-slot operation, services are turned off for a very short period of time while a secondary in-place sync is performed to quickly update the destination slot with any missing data. 3.5 Monitoring Improvements ------------------------------------------------------------------------------ This release provides various improvements to the monitoring infrastructure and tooling, including: - Unified Alerting Platform Part 2 (HPCM-5221) HPCM 1.11 further improves the new alerting infrastructure introduced in HPCM 1.10. It provides alerting for cooling distribution unit (CDU) telemetry, CDU and cabinet leak events, AMD GPUs, slingshot switch status heartbeat, email notifications for alerts, and more. Refer to the section "Monitoring alerts with unified alerting" in the HPCM System Monitoring Guide for more details. - Provides advanced option to destroy/rebuild all topics in kafka (HPCM-425) See 'cm monitoring advanced kafka wipe -h' for more details about use of this new option. - Adds a new datadir health check for kafka (HPCM-4602) This release includes a new check for kafka that checks for mismatched topic ids (meaning the folders are from different cluster instances), missing replica folders, or extra folders. - Records HPCM-specific logs into opensearch (HPCM-4430) This release records logs from /opt/clmgr/log and /var/log/log.ctdb from the admin, and /var/log/glusterfs from su-leaders into opensearch. - Adds collection of PCIM 'metric_cooldev_pdu' to pdu-collect (HPCM-3213) - Enhanced slingshot health reporting, including, but not limited to: o report prots with ber and tx/rx pause (HPCM-2500) o report ports with UCW and llr_replay errors (HPCM-4095) o enhanced error handling in slingshot health reporting (HPCM-4080) o report MultiBit Errors (MBE) (HPCM-5418) o report which ports are unconfigured (HPCM-4096) - Adds automation of cluster view configuration in dashboards (HPCM-4268) A new procedure detects the hardware type of the system and then automates the configuration in the cluster view panel, as well as attempting to automate the Grafana dashboards. - Adds new 'cm monitoring {slurm,pbs}' command (HPCM-5546) See the following sections in the HPCM System Monitoring Guide for more information: o Monitoring Altair PBS Professional operations o Monitoring SLURM operations - Adds new rackmap utility (HPCM-4291) The rackmap utility provides users with the ability to display power status information, temperature readings, HPE Slingshot information, and other cluster node telemetry data in a two-dimensional rack map display directly from the cluster manager command line interface. Refer to the section "Visualizing telemetry and status information with the rackmap tool" in the HPCM System Monitoring Guide for more details. - Adds a new /opt/clmgr/tools/monitoring.sh script to collect data pertinent to analyzing monitoring issues (HPCM-5663) As always, refer to the HPE Performance Cluster Manager System Monitoring Guide for information on configuring monitoring tools and services in HPCM. 3.6 Scalable Unit Leader Nodes Improvements ------------------------------------------------------------------------------ This release contains several related changes that improve the performance of Scalable Unit leader nodes, including: - Insecure NFS Disabled by Default (HPCM-5932) HPCM 1.11 now blocks non-root users from accessing the gluster NFS server present on systems using SU leaders. To effect this change on a system which has already been deployed, run the following commands from one of the SU leaders and then reboot the leader: # volume set cm_shared nfs.ports-insecure off # volume set cm_logs nfs.ports-insecure off # volume set ctdb nfs.ports-insecure off # volume set cm_obj_sharded nfs.ports-insecure off The next time the leader reboots, the gluster NFS server will not accept non-privileged ports. HPE notes that although the gluster CLI will state that nfs.ports-insecure off as the default, HPE has found that it must be set to off explicitly for gluster NFS to have the correct behavior. The behavior change takes affect the next time gluster NFS is restarted. 3.7 Quorum High Availability Setup Improvements ------------------------------------------------------------------------------ This release contains changes that improve the robustness of Quorum-HA setups, including: - Improved ability to handle heavy connections without fencing (HPCM-5632) Under certain conditions when the admin virtual machine is under heavy connection loads, the connections managed by the firewall could get exhausted which would cause operations on the physical node to fault, in turn causing the admin virtual machine to be fenced. This change allows HPCM to better handle this situation in both Quorum-HA and SAC-HA configurations. 3.8 Power Consumption Management Improvements ------------------------------------------------------------------------------ HPCM 1.11 introduces the following technical previews: - System Power Capping (HCPM-5802) HPCM 1.11 introduces system-level power capping as a technical preview for the HPE Cray EX systems only. For more information on installing and running system power capping, see the HPCM Power Consumption Management Guide. - Added cpwrcli and mpwrcli interfaces (HPCM-5424,HPCM-5590) The cpwrcli command and the corresponding cm power REST API enable users to power on and power off nodes, chassis, and other components. Likewise, the mpwrcli command allows you to set power limits for certain controllers and nodes. Both commands are introduced as technical previews in HPCM 1.11. For more information, see CM Power Service REST API Documentation on the cluster manager home page and the HPCM Power Consumption Management Guide. 3.9 Documentation Updates ------------------------------------------------------------------------------ The following documentation was updated for the HPCM 1.10 release: - HPE Performance Cluster Manager 1.11 Release Notes - HPE Performance Cluster Manager Getting Started Guide (007-6500-016) - HPE Performance Cluster Manager Installation Quick Start (P35632-009) - HPE Performance Cluster Manager Installation Guide for Clusters with Scalable Unit (SU) Leaders (P36611-008) - HPE Performance Cluster Manager Installation Guide for Clusters without Leader Nodes (P36610-008) - HPE Performance Cluster Manager Installation Guide for Clusters with ICE Leader Nodes (P36609-008) - HPE Performance Cluster Manager Command Reference (P36705-008) - HPE Performance Cluster Manager Administration Guide (007-6499-016) - HPE Performance Cluster Manager System Monitoring Guide (S-0120-005) - HPE Performance Cluster Manager Power Consumption Management Guide (007-6498-016) - HPE Performance Cluster Manager Upgrade Guide (S-9926-004) HPE provides direct links to specific versions of the HPCM manuals: - Getting Started Guide: https://www.hpe.com/support/hpcm-gsg-016 - Installation Quick Start: https://www.hpe.com/support/hpcm-inst-qs-009 - Install With SU Leader Nodes: https://www.hpe.com/support/hpcm-inst-su-leaders-008 - Install Without Leader Nodes: https://www.hpe.com/support/hpcm-inst-no-leaders-008 - Install With ICE Leader Nodes: https://www.hpe.com/support/hpcm-inst-ice-leaders-008 - Upgrade Guide: https://www.hpe.com/support/hpcm-upgrade-004 - Command Reference: https://www.hpe.com/support/hpcm-cr-008 - Administration Guide: https://www.hpe.com/support/hpcm-admin-016 - System Monitoring Guide: https://www.hpe.com/support/hpcm-monitor-005 - Power Consumption Management Guide: https://www.hpe.com/support/hpcm-power-016 The latest versions of documentation are always available on the HPE Support Center. The Document List for HPE Performance Cluster Manager can found online: https://support.hpe.com/hpesc/public/docDisplay?docId=a00050433en_us If a new revision of a manual is not ready at release, a placeholder document with a link to the online version of the manual will be provided in the clmgr-docs package included on the product ISO. 3.9.1 Manual Changes of Interest ------------------------------------------------------------------------------ The following information has been moved out of the HCPM PDF manuals: - Cluster Manager Ports The section on Cluster Manager Ports has been removed from the HPCM Administration Guide and moved into a searpate document which is available in the /docs directory of the product ISO and which gets installed to the following location: /opt/clmgr/doc/HPCM_Port_Info.pdf - Singularity Examples The examples covering installation of Singularity containers has been removed from the HPCM Administration Guide and moved into a separate document which gets installed to the following location: /opt/clmgr/doc/HPCM_Singularity_Examples.pdf 3.9.2 Release Note Update 01 Changes ------------------------------------------------------------------------------ Update 01 of the HPCM 1.11 release notes updated the following sections: - 2.2 Operating System Support Note [3] mistakenly referenced CPE 23.03; the correct version is 24.03. - Known Problems and Workarounds section updates: o 4.7.8 GPU native monitoring failing on EX254n platform o 4.7.9 Native monitoring GPU_AMD_temp display is 0 on EX255a platform o 4.11.9 Unable to create RHEL9.x images in Q-HA admin virtual machine 3.10 Deprecated Features ------------------------------------------------------------------------------ The following features have been deprecated in the HPCM 1.11 release: - su-leader-setup --add-leaders option HPE is deprecating the --add-leaders option used to add new groups of SU leaders to an existing system. See the section "4.5.2 Growing SU-Leaders No Longer Recommended" below for more information. 3.11 Future Deprecations ------------------------------------------------------------------------------ The following section describes features that should be avoided when possible because HPE plans to deprecate them in the future. HPE announces deprecations in advance so that users have time to plan in ways that minimize the impact of specific changes. - Writeable NFS Options HPE plans to deprecate the writable NFS options: xfs file per node and directory tree per node. These options were originally designed for use on the SGI ICE and HPE SGI 8600 platforms. ****************************************************************************** 4.0 Known Issues and Workarounds ****************************************************************************** NOTE: Failure to reboot the admin node after upgrading or installing HPCM patches may result in a non-functioning cluster. HPE recommends that users reboot the admin node after upgrading or updating HPCM software on the admin node to ensure that all relevant services are restarted. 4.1 Upgrade ------------------------------------------------------------------------------ 4.1.1 Preparing a System for Upgrade ------------------------------------------------------------------------------ There are certain steps you can take before upgrading from an earlier version of HPCM 1.x that will provide for a smoother upgrade experience. Many of these steps are already outlined in the HPCM Installation Guide in the section entitled "Upgrading from an HPE Performance Cluster Manager 1.x release". The following are additional steps that may not be noted in the guide yet. NOTE: HPE tested upgrade scenarios from HPCM 1.10 to HPCM 1.11 only. HPE only tests upgrades from the most recent N-1 release to the latest release N. To upgrade from N-2 (e.g., HPCM 1.9) to HPCM 1.11, follow the upgrade guide for the HPCM 1.10 release, and then follow the upgrade guide for HPCM 1.11. 4.1.2 Problems Creating Images after Operating System Upgrades ------------------------------------------------------------------------------ When creating images with a new HPCM version based on a newly supported operating system, HPE recommends that you create initial node images without any operating system updates. If an operating system updates repo is already selected, unselect it and proceed with initial image creation. Once you have confirmed that your image has been created, you can re-select the operating system updates repo and apply updates to the image. An operating system updates repo may sometimes contain updates that have not been tested by HPE. By using the original operating system release without updates, image creation will be closer to what was validated at the time of the release and it provides more visibility into which operating system packages are being updated. 4.1.3 Remove Mellanox OFED before upgrading running su-leaders ------------------------------------------------------------------------------ IM#1001790278 Attempting to upgrade an su-leader node which has the Mellanox OFED bits installed will lead to errors due to conflicts between operating system OFED packages and those provided by Mellanox. As such, the Mellanox OFED packages must be removed before the upgrade or refresh, and then re-installed after the upgrade is complete. To remove the Mellanox OFED packages, use the following command: leader1:~ # /usr/sbin/ofed_uninstall.sh --force When the upgrade of the su-leader is complete, reinstall the Mellanox OFED packages. 4.1.4 ldmsd@.service Error Messages during Upgrade ------------------------------------------------------------------------------ HPCM-2718 When upgrading to a new HPCM version on the admin node, during the Installation of the cray-ldms package, the following failure may be reported: admin: Failed to try-restart ldmsd@.service: Unit name ldmsd@.service is missing the instance name. admin: See system logs and 'systemctl status ldmsd@.service' for details. This message is harmless and can be ignored. 4.1.5 samba-ad-dc-libs deprecated in sles15sp5, but not obsoleted by SUSE ------------------------------------------------------------------------------ HPCM-5085 SUSE removed the samba-ad-dc-libs package, but did not setup any rules to obsolete the package. This package may be installed on some HPCM systems. The HPCM Upgrade Guide contains specific instructions on removing the package from images and running systems as part of the HPCM 1.9 to HPCM 1.10 upgrade process. 4.1.6 Package Upgrade Failures ------------------------------------------------------------------------------ HPCM-6115 During testing of upgrades from HPCM 1.10 + RHEL88 to HPCM 1.11 + RHEL89, HPE noticed scriptlet failures reported by packages provided by the operating system vendor such as the following: admin: Running scriptlet: systemd-239-78.el8.x86_64 1419/1419 admin: Couldn't write '10000' to 'kernel/perf_event_max_sample_rate': Invalid argument admin: warning: %transfiletriggerin(systemd-239-78.el8.x86_64) scriptlet failed HPE did not see any failures that impacted functional operation the system. Concerned users may review any scriptlet failures reported during installation or upgrade. 4.1.7 Version Locks When Upgrading Admin/Leader Nodes and Images ------------------------------------------------------------------------------ HPCM-6054 The HPCM Upgrade guide contains specific information on version locking of packages that may exist in more than one software repository. HPE recommends following the instructions in that guide to prevent overwriting required packages with others from other software repositories that might be configured on the system such as EPEL or SLE Updates for COS customers. 4.1.8 SLES15SP5 QU1 c-c fails on brltty and libbrllap versions too old ------------------------------------------------------------------------------ HPCM-5885 When upgrading the physical admin nodes in a quorum HA configuration, the virtual admin node, or a physical admin node in a non-HA configuration with the SLES15SP5 QU1 iso, there may be package dependency errors during the upgrade causing the upgrade to error out before completing. These errors may look like: Resolving package dependencies... 2 Problems: Problem: nothing provides 'group(brlapi)' needed by the to be installed libbrlapi0_8-6.4-150400.4.3.3.x86_64 Problem: nothing provides 'system-user-brltty = 6.4-150400.4.3.3' needed by the to be installed brltty-6.4-150400.4.3.3 To resolve this error, remove the conflicting packages using the zypper command on all of the physical admin nodes and virtual admin nodes as well. These packages are not used by the Cluster Manager and are safe to remove. To remove the packages, run: # zypper remove brltty libbrlapi0_8 4.2 Installation ------------------------------------------------------------------------------ 4.2.1 All-at-once Kickstart Alternate Install Method Broken ------------------------------------------------------------------------------ IM#1001767617,1001810593 The alternate installation method using all-at-once kickstart is currently not working as expected. Customers are advised not to use this method until HPE provides a fix for the issue. 4.2.2 Caution about creating bootable USB drives ------------------------------------------------------------------------------ There are instructions in the HPE Performance Cluster Manager Installation Guide that describe how to create a bootable USB drive to install the HPCM product on an admin node. HPE recommends using a 32GB (or larger) USB drive; smaller USB drives will run out of space and the process to create a bootable USB drive may turn the media into a read-only device. 4.2.3 cray-rasdaemon package not installed by default ------------------------------------------------------------------------------ IM#1001779638 The cray-rasdaemon package is not installed by default by HPCM, but the package is available within the product for use on any Cray EX system. cray-rasdaemon is a RAS (Reliability, Availability and Serviceability) logging tool. It currently records memory errors, using the EDAC tracing events. EDAC is drivers in the Linux kernel that handle detection of ECC errors from memory controllers for most chipsets on i386 and x86_64 architectures. EDAC drivers for other architectures like arm also exists. This userspace component consists of an init script which makes sure EDAC drivers and DIMM labels are loaded at system startup, as well as a utility for reporting current error counts from the EDAC sysfs files. 4.2.4 Image created from sles15sp5 + updates - boot from disk only fails on some platforms; falls to dracut ------------------------------------------------------------------------------ HPCM-5162 An issue has been observed where nodes that are assigned to boot from disk only or set to disk bootloader, and the image assigned the node was created using the SLES15 SP5 distro and SLES15-SP5 distro updates, the node will fail to boot and drop to a dracut shell prompt due to being unable to find any disk devices. This issue has not been observed however if the image was created originally without the SLES15 SP5 distro updates, then updated using the 'cm image zypper' or 'cm image update' commands with the distro updates repository selected or included in the repo group. To resolve this issue, we recommend creating the node image without the SLES15 SP5 distro updates repository selected or included in the repo group. The image can then be updated with SLES15 SP5 updates after the image was already created using the 'cm image zypper' or 'cm image update' commands. The node assigned to the image can also be updated using the 'cm node zypper' or 'cm node update' commands with the distro updates repository selected or included in a repo group. 4.2.5 Gluster Package Conflicts on Layered HPCM Installation ------------------------------------------------------------------------------ HPCM-390 When performing a layered installation (i.e., Linux distro installed first, followed by the HPCM install later), ensure that there are no gluster related packages installed BEFORE attempting the HPCM installation to avoid package conflicts. The HPCM product provides its own tested gluster packages. 4.2.6 configure-cluster admin failure due to system-user-qemu user conflict ------------------------------------------------------------------------------ HPCM-5358 When performing a layered installation (i.e., Linux distro installed first, followed by the HPCM install later) using the standalone-install.sh script, there is a chance that the configure-cluster step will fail due to a UID conflict. Both SLES15 SP5 (system-user-qemu) and RHEL8.8 (qemu-kvm-common) will attempt to create a qemu user with the UID 107. In some cases, other packages which create users may have already used the UID 107, which causes a failure like the following: /usr/sbin/groupadd -r -g 107 qemu /usr/sbin/useradd -r -c qemu user -d / -g qemu -u 107 qemu -s /usr/sbin/nologin useradd: UID 107 is not unique error: %prein(system-user-qemu-20170617-150400.22.33.noarch) scriptlet failed, exit status 1 error: system-user-qemu-20170617-150400.22.33.noarch: install failed To workaround the failure, install the appropriate qemu packages on the admin node BEFORE attempting to run the standalone-install.sh script. 4.2.7 Q-ha SLES15sp5 QU1 qemu-generated adminvm.xml files generated ------------------------------------------------------------------------------ HPCM-6157 When installing or upgrading physical admin nodes in a quorum HA configuration using the SLES15SP5 QU1 install iso, there may be a qemu-generated adminvm.xml file created on one or more of the physical admin nodes that was not generated by the quorum HA setup tooling. This generated xml file should not be used for managing the functional admin VM and should be removed. To remove the file, run the following command: # pdsh -g gluster rm -f /etc/libvirt/qemu/adminvm.xml 4.2.8 DL365 Gen11 Servers need additional kernel parameter with RHEL8.9 ------------------------------------------------------------------------------ HPCM-5913 HPE internal testing has revealed an issue that causes DL365 Gen11 Servers to panic during boot when running RHEL 8.9. The problem does not impact older or newer versions of RHEL8. To workaround the issue, HPE recommends adding an additional kernel parameter. The procedure to set a kernel parameter is different for nodes that are provisioned versus physical admin nodes. - DL365 as Leader, Service or Compute Node 1) Set the kernel flag with the 'cm node set' command. For example: # cm node set -n {NODE/S} --kernel-extra-params nox2apic 2) Follow the instructions to provision the node like normal. - DL365 as Admin Node (Quorum-HA or SAC-HA Physical Nodes) 1) Boot the admin node installer from boot media like normal 2) When presented with a menu system for installation, follow on-screen instructions until asked to provide the kernel list. 3) When prompted for "Additional parameters (like console=, etc)," enter the following additional kernel parameters: --- nox2apic * It's important to use three (3) dashes to make the kernel parameter persistent across reboots. 4) Follow on-screen instructions to complete the installation. 4.2.9 Switch Default ICE diags in rpmlists to xe diags for admin and default ------------------------------------------------------------------------------ HPCM-2319 HPCM provides field and performance diags packages for SGI 8600 (ice) systems and for other systems (xe). Starting with HPCM 1.8, the ice diags are no longer the default diags packages selected for installation on the admin, service and non-ice compute nodes. As such, for SGI 8600 customers only, when upgrading to HPCM 1.8, HPCM will report file conflicts between the field_diags_licensed_xe and field_diags_licensed_ice packages, as well as conflicts between the perf_diags_licensed_xe and perf_diags_licensed_ice packages while running the refresh-node or refresh-image commands with the default rpmlist. To work around this issue, replace the field_diags_licensed_xe and perf_diags_licensed_xe packages with the field_diags_licensed_ice and perf_diags_licensed_ice packages in the generated rpmlists before attempting to run the refresh-node or refresh-image commands during the HPCM upgrade. 4.3 Networking ------------------------------------------------------------------------------ 4.3.1 ibN ifcfg files not automatically generated on the admin node ------------------------------------------------------------------------------ HPCM-844 The admin node is not automatically assigned IP addresses for the InfiniBand interfaces. These following instructions are the commands needed to assign addresses and create the ifcfg-ibX files if needed. Adding IP address for ib0 interface: # cm node nic add -n admin -N ib0 -w ib0 --compute-next-ip --interface-name admin-ib0 if ib1 is required (optional): # cm node nic add -n admin -N ib1 -w ib1 --compute-next-ip --interface-name admin-ib1 Write the config to the database: # cm node update config --sync -n admin Print the admin ibX IP address values to use in building the ifcfg-ib[0,1] interface files. # cadmin --show-ips --node admin IP Address Information for node: admin ifname ip Network admin 172.23.0.1 head admin-bmc 172.24.0.1 head-bmc admin-ib0 10.148.0.1 ib0 admin-ib1 10.149.0.1 ib1 Use the admin-ib0 IP address from the above output for the IPADDR field in the ifcfg-ib0 file. Repeat if necessary for the ifcfg-ib1 file. SLES15 ifcfg file location: /etc/sysconfig/network/ifcfg-ib1 /etc/sysconfig/network/ifcfg-ib0 # SLES15 ib0 STARTMODE='onboot' BOOTPROTO='static' IPADDR='10.148.0.1' NETMASK='255.255.0.0' WIRELESS='no' LINK_REQUIRED='no' Use the admin-ib0 IP address from the above output for the IPADDR field in the ifcfg-ib0 file # SLES15 ib1 STARTMODE='onboot' BOOTPROTO='static' IPADDR='10.149.0.1' NETMASK='255.255.0.0' WIRELESS='no' LINK_REQUIRED='no' RHEL8x ifcfg file location : /etc/sysconfig/network-scripts/ifcfg-ib0 /etc/sysconfig/network-scripts/ifcfg-ib1 # RHEL8x ib0 DEVICE=ib0 TYPE=InfiniBand BOOTPROTO=static PREFIX=16 IPADDR=10.148.0.1 ONBOOT=yes # RHEL8x ib1 DEVICE=ib1 TYPE=InfiniBand BOOTPROTO=static PREFIX=16 IPADDR=10.149.0.1 ONBOOT=yes 4.3.2 configure-cluster: "Unable to start master OpenSM on host.." Errors ------------------------------------------------------------------------------ IM#1001810442 When attempting to administer the InfiniBand fabric in configure-cluster -> Configure Infiniband Fabric, failures may be reported due to missing the opensm package. This is caused because opensm is no longer a dependency for the opensource-opensm-multifabric package. To work around the issue, HPE recommends that system administrators install MLNX OFED on the nodes or image in order to administer the InfiniBand fabric. Alternatively, administrators may install the opensm package provided in the operating system repository. 4.3.3 Enabling predictable net names on nodes with disk-bootloader enabled ------------------------------------------------------------------------------ IM#1001811844 net.ifnames is used to determine whether to enable predictable net names, which is set by a conf.d script. This is normally supplied to nodes through tftpboot configuration files, but for nodes that boot directly to disk or have disk bootloader enabled, that conf.d script needs to be run before rebooting. To determine which nodes have disk-bootloader enabled, run the following command: # cm node show -n "*" --disk-bootloader To determine which nodes boot off disk directly, for nodes booted on efi, run 'efibootmgr' and observe the "BootCurrent" device. For nodes booted on legacy BIOS, observer bootorder from the bmc or bios menu. After upgrading to HPCM 1.8 (or later) on a compute or leader node that boots to disk or has boot_diskloader enabled, run the following command, where is the name of the node: # ssh /etc/opt/sgi/conf.d/80-ondisk-kernel-parameters Or if all nodes in a node group boot to disk (e.g. su-leaders), use pdsh instead: # pdsh -g su-leader /etc/opt/sgi/conf.d/80-ondisk-kernel-parameters 4.3.4 Setting backup-dns-server fails to start named service ------------------------------------------------------------------------------ IM#1001787548 When configuring a compute/service/login node as a backup DNS server, the new DNS server will appear in /etc/resolv.conf for all other service nodes, but the operation does not currently start the named service. The workaround is to manually start and enable the named service on the backup-dns server after setup is complete. 4.3.5 Mellanox OFED module compatibility issues ------------------------------------------------------------------------------ HPCM-3909 HPE has identified several possible issues with MLNX OFED modules not loading with either the GA kernel or specific kernel updates. For instance, with MLNX OFED 23.07-x modules do not load with the RHEL 8.x base kernels, but do load with kernel updates. For SLES, the situation is reversed. HPE recommends reading through MLNX OFED release notes for specific details regarding MLNX OFED module and operating system kernel compatibility. In some cases, the 'mlnxofedinstall' command may help: # ./mlnxofedinstall --add-kernel-support --kmp 4.4 High Availability ------------------------------------------------------------------------------ 4.4.1 SAC-HA Requires both High Availability and ResilientStorage for Updates ------------------------------------------------------------------------------ IM#1001757666 The current SAC-HA solution for HPCM customers requires the High Availability add-on as well as the Resilient Storage add-on product. The SAC-HA solution requires the dlm package, which only ships as part of the Resilient Storage add-on. Since the dlm package ships on the RHEL 8.x ISO image, the dependency can be satisfied at initial install. However, it is possible to encounter a case where updates to the High Availability packages require an update the dlm package, which is only available through the Resilient Storage channel and requires a valid subscription to access. The SAC-HA solution for HPCM customers has been tested with the packages provided on the RHEL 8.x media as of the HPCM 1.6 release. 4.4.2 HA-RLC: installation on leaders requires ha_net_ip assignment ------------------------------------------------------------------------------ HPCM-1617 HA-RLC installation on leaders will fail on SLES15 SPx unless the ha_net_ip variable is assigned the correct value. In the cluster configuration file, define ha_net_ip=192.168.161.1 for r1lead1 and ha_net_ip=192.168.161.2 for r1lead2 and rediscover the leaders to work around the issue. 4.4.3 quorum-ha physical nodes not updating slot chooser grub after clone-slot ------------------------------------------------------------------------------ HPCM-1705 HPCM supports cloning slots on Quorum-HA physical nodes. However, this only lets sys admins work with the partitions related to the operating system and HPCM. To allot the maximum amount of space to the virtual machine, the disk or LUN used for the admin node virtual machine image should not be split into pieces. It is important to note that the gluster volume metadata is on the root filesystem, and the gluster disks only have one partition with a large filesystem by default. Gluster does not support the rollback of gluster versions; only the roll forward. Therefore, if the slot is cloned and the newly cloned slot updated with new packages that also include a gluster version change, then the older/ original slot may be incompatible the gluster bricks mounted at /data on the physical nodes. If maintenance work may include gluster version change/update, HPE recommends backing up the admin virtual image in some other way to support rollback. In addition, HPCM 1.10 does not automatically update the label information reported by the 'cadmin --show-root-labels' command. This means that after cloning, the above command may report 'slot 2: (no install found)' This issue is purely cosmetic. HPE recommends reaching out to the support organization for a procedure to update the label reported by the slot chooser. 4.4.4 Q-HA SLES15 SP5: unlabeled data drive tends to remain /dev/sda ------------------------------------------------------------------------------ HPCM-4948 When installing Quorum-HA, HPE recommends clearing the data drive using wipefs tools SLES15 SP5 OS to ensure that the disk containing the OS be recognized as /dev/sda and the data disk as /dev/sdb. 4.4.5 Enabling LUKS2 Security on Q-HA Physcial Nodes ------------------------------------------------------------------------------ HPCM-5801 Starting with HPCM 1.11, customers may use luks2 encryption on the physical nodes making up the quorum-ha soluiton. Note that the gluster area is not encrypted at this time. luks2 encryption may also be used in the admin virtual machine, although the SWTPM (software TPM) data is stored in a non- encrypted shared gluster filesystem (the same one that houses the virtual machine admin node itself) on the physical nodes. HPE may investigate TPM encryption in a future release. HPE strongly recommends saving any luks2 password for the physical nodes and the virtual machine in case the TPM is not able to unseal. Testing has shown that the SWTPM may not save the enrollment data until the virtual machine is shut down. If following standard procedures, the virtual machine will reboot when the installation of the admin node is complete. NOTE: Changing non-encrypted root filesystems into encrypted root filesystems is not supported. The following steps outline how to enable luks2 encryption on the physical admin nodes in the Quorum HA solution so that any physical node can start the admin VM and the VM can unlock the root volumn using the SWTPM. 1) Add UUID to adminvm.xml so that it doesn't regenerate each time: adminvm 361cedac-6aca-405b-bfe8-382cb46b39c9 158059488 .. 2) Add TPM section with persistence 3) Run the following commands to create swtpm in shared storage and linking to /var/lib/libvirt/swtpm: # mkdir /adminvm/swtpm # chmod 711 /adminvm/swtpm # pdsh -g gluster rm -rf /var/lib/libvirt/swtpm/ # pdsh -g gluster ln -s /adminvm/swtpm /var/lib/libvirt/swtpm 4.5 SU-Leader ------------------------------------------------------------------------------ 4.5.1 conserver reports errors on su-leaders if no nodes are assigned ------------------------------------------------------------------------------ IM#1001790841 The conserver service reports the following error if no consoles are found, which is the case when there are no nodes assigned to the su-leader: Node leader1 reported error: Job for conserver.service failed because the control process exited with error code. This error message will not appear once nodes are assigned to the leader, which is the normal case. 4.5.2 Growing SU-Leaders No Longer Recommended ------------------------------------------------------------------------------ IM#1001811296, HPCM-1673, HPCM-1600, HPCM-6128 The original design of the SU Leader system includes support to add additional sets of 3 leaders at a later time. However, growing the gluster volumes, while technically supported by gluster, has proven difficult to do correctly and repeatably in automation without human intervention. As such, HPE no longer recommends attempting to grow SU-Leaders. When there is a need to increase the number of SU leaders in use on site, HPE recommends the following approach: 1) Backup system logs and console logs (optional) 2) De-couple the admin node from the su leaders (disable-su-leader) 3) Discover additional leaders 4) Run the su-leader-setup command to configure the SU leaders (see the procedure in the HPCM Installation Guide for details) Contact HPE support for more information and help with this procedure. The section "Adding scalable unit (SU) leader nodes" will be revised in the HPCM Installation Guide in the future. 4.5.3 su-leader-collection package fails to install ------------------------------------------------------------------------------ IM#1001726164, IM#1001743636 The su-leader-collection package has a dependency on the ctdb package. The ctdb package, in turn, expects a specific version of the samba packages. If these two packages get out of sync, by installing samba updates and not the corresponding ctdb updates for instance, the su-leader-collection package will fail to install. Should the system end up in this state, there are two possible workarounds. Option 1 is to make the high availability repository available so that an updated ctdb package can be pulled in to match the samba updates already installed on the system or in an image. Option 2 is to downgrade the samba packages to match the version designed to work with the ctdb package available in current software repositories. This problem may also present itself when installing the su-leader-collection package on a system where a RHEL operating system updates repository is available. If the high availability repository is also not available, the samba-client-libs and samba-common packages can get out of sync and the installation of the su-leader-collection will report dependency errors. The workarounds are the same as those listed above. One way to prevent a system from encountering these issues is to lock specific packages. For instance, by locking the samba base packages, it will be much more difficult for the samba and ctdb packages to get out of sync. When both the samba and the corresponding ctdb packages are available, package locks may be removed in order to complete the update. Refer to the Linux operating system doumentation for more details about how to lock/unlock packages. 4.5.4 NVME gluster disks must use /dev/disk/by-path in su-leader-nodes.lst ------------------------------------------------------------------------------ HPCM-1440 su-leader-setup documentation recommends using /dev/disk/by-path devices in the list of devices for gluster disk, but /dev/sdX names have worked. However, with NVME drives, using /dev/nvmeX names in su-leader-nodes.lst will produce a series of errors. The tool will inform the user that the device is not using /dev/disk/by-path device names and will produce a series of ugly "basename" bash errors. These errors are easily avoided by using the documented /dev/disk/by-path device names that HPE recommends. Incorrect device names: leader1,172.24.255.1,172.23.255.1,/dev/nvme0n1 leader2,172.24.255.2,172.23.255.2,/dev/nvme0n1 leader3,172.24.255.3,172.23.255.3,/dev/nvme0n1 Correct device names: leader1,172.24.255.1,172.23.255.1,/dev/disk/by-path/pci-0000:44:00.0-nvme-1 leader2,172.24.255.2,172.23.255.2,/dev/disk/by-path/pci-0000:44:00.0-nvme-1 leader3,172.24.255.3,172.23.255.3,/dev/disk/by-path/pci-0000:44:00.0-nvme-1 4.5.5 Warning messages on su-leader reboot ------------------------------------------------------------------------------ HPCM-3259 After rebooting an su-leader, the following warning message may be displayed: Warning Mount /opt/clmgr/shared_storage, on leaderX, has EXTRA fuse/glusterfs process, count: 2 A small number of duplicate mounts are normal and not a cause for concern. In a future release, HPE is investigating reduced use of bind mounts which will make it easier to properly manage mounts in the HA monitoring scripts. 4.5.6 Rebooting more than 3 leaders causes glusterd and ctdb issues ------------------------------------------------------------------------------ HPCM-2900 HPE has observed that when rebooting more than three (3) leaders at a time, even when following quorum rules, glusterd can get stuck, causing a failure in all brick processes, so mounts get stuck and ctdb status will remain DISCONNECTED. HPE strongly recommends rebooting no more than three (3) leaders at a time, which observing quorum rules. HPE is investigating the issue. 4.5.7 command failed gluster-and-ctdb-health-check --quiet during upgrade ------------------------------------------------------------------------------ HPCM-5097 HPCM 1.10 introduced a script to check the health of gluster and ctdb. HPE recommends running this script and fixing any problems BEFORE upgrading the SU leader nodes. The script is /opt/clmgr/bin/gluster-and-ctdb-health-check. Since this script was introduced in HPCM 1.10, it will not exist on SU leader nodes that haven't yet been upgraded. A step has been added to the upgrade guide to manually copy the script to the SU leaders before upgrading them as follows: # pdcp -g leader /opt/clmgr/bin/gluster-and-ctdb-health-check /opt/clmgr/bin/gluster-and-ctdb-health-check 4.6 System Config and Discovery ------------------------------------------------------------------------------ 4.6.1 Card type, bmc username, password and baud rate required in config files ------------------------------------------------------------------------------ IM#1001727110,1001767084,1001765634 Starting in HPCM 1.4, you must provide additional information in any config files you plan to use for discovery. In the past, if the BMC user name, password, or baud was not specified in the config file, the BMC would be probed for the values. This probing did take time and did not have unlimited scaling abilities. The requirement now is that you always specify all four of the following values: - card_type - bmc_username - bmc_password - baud_rate The card_type values are currently case sensitive, so for iLO systems, use 'card_type=iLO' and systems with non-iLO BMCs, use 'card_type=IPMI". Failure to use the correct cases will result in broken consoles. HPCM will talk to iLOs and non-iLO BMCs using different APIs, so failure to provide a card_type value or the incorrect card_type value typically manifests as a failure for conserver to work or for node discovery to take an abnormal amount of time due to failed ping attempts. If you have already discovered the nodes, you can use the 'cadmin' commands to set values. Replace the username and password with the correct values: # cm node set --bmc-username USER --bmc-password PASSWORD -n NODES # cm node set --baudrate RATE -n NODES The 'cm node set' command does not yet support setting card type, but that may be done with the following legacy cmu_mod_node command as follows: # cmu_mod_node --mgt-card CARDTYPE --hostname NODES 4.6.2 Adding ICE Compute Nodes to Discover Config File Fail ------------------------------------------------------------------------------ IM#1001809637 Attempting to discover ICE compute nodes via the discover config file will cause the nodes to appear in the HPCM database, but the hostnames will fail to be recognized. To work around the issue, ICE compute nodes should not be listed in the discover config file. 4.6.3 Netboot Files Not Cleaned Up on Node Deletion from Database ------------------------------------------------------------------------------ IM#1001787083 When nodes are added into the db with the 'cm node add' or 'discover' commands, pxe config files are created for newly added nodes in /opt/clmgr/tftpboot/grub2/cm. These files are also created when invoking the 'cadmin --set-dhcp-bootfile' (e.g., switching between grub2 and ipxe-direct). When a node is removed, the netboot file for the deleted node is not removed. If nodes are added and the netboot environment is not refreshed, this can cause issues. The current workaround is to manually remove netboot files for deleted nodes. 4.6.4 Fastdiscover fails to add Cray EX blades to DB ------------------------------------------------------------------------------ IM#1001810367 On Cray EX systems, cmcinventory first adds the controllers to the fastdiscover.conf file and then goes through the entire process of discovering them. Once they are in the database and reachable, HPCM then scans for node MAC addresses within the controllers, and adds them to the fastdiscover.conf file. If a system administrator attempts to use a fastdiscover.conf file which is already populated with NodeControllers and Nodes to add back nodes, the command will display the following error: (): Attempt to add controllers failed: Controller name 'x9000c1r7b0' already used at /opt/sgi/lib/NewNodes.pm line 614. If failure was due to previously existing nodes consider option --skip-existing-nodes To work around the issue, use the '--skip-existing-nodes' option as instructed in the error message. 4.6.5 Blade fails to image with https, never moves to next interface ------------------------------------------------------------------------------ HPCM-1450, HPCM-3884 In some cases, when attempting to image blades with https, the blades will fail to hang on the https interface and also fail to fall back to the next available interface. To work around this problem, either change the boot order so that http(s) is not listed first or console into the node and select the pxe IPv4 option. 4.6.6 Unknown kernel command line parameters in dmesg log ------------------------------------------------------------------------------ HPCM-3781 [ 0.000000] Unknown kernel command line parameters "BOOT_IMAGE=/vmlinuz-5.14.21-150400.24.41-default boot=LABEL=sgiboot biosdevname=0", will be passed to user space. The above message appears in the dmesg log. The reason for the message is the Linux kernel is being a little more explicit as to what parameters on the command line the kernel did not process. It is known the BOOT_IMAGE parameter will be called out but it can be safely ignored. It is an informational message. 4.6.7 EX420 consoles hang at serial drv ------------------------------------------------------------------------------ HPCM-5148 When using COS or RHEL, if node console output stops with the following error: "Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled" use the following command to set extra kernel parameters: # cm image set --kernel-extra-params PARAMS -n NODENAME where PARAMS for COS and RHEL are as follows: COS : 8250.nr_uarts=1 8250.use_polling=1 8250.skip_non_ioport=1 RHEL: 8250.nr_uarts=5 4.7 Monitoring ------------------------------------------------------------------------------ 4.7.1 het_trap_processor ERROR mqtt client connect failure ------------------------------------------------------------------------------ IM#1001740008 On a newly installed admin node, het_trap_processor reports a connection failure to mqtt. This error can occur whenever kafka and related monitoring services are not yet running. To avoid the error, use the cm monitoring commands to turn on kafka/elk/alerta-based monitoring services. ==> het_trap_processor.log <== 2020-02-18T19:34:00.643Z root ERROR mqtt client connect failure 127.0.0.1:1883 error(111, Connection refused) 2020-02-18T19:34:00.643Z root INFO Starting HET Processor 4.7.2 Grafana hpeclusterview_panel plugin is unsigned ------------------------------------------------------------------------------ HPCM-1215 The root_url for Grafana on HPCM systems is https://localhost:3000/grafana. The Grafana community who owns the plugin signing process requires that plugins be signed with root_url=https://localhost:3000. The hpeclusterview_panel plugin can be signed with url_root=https://localhost:3000/grafana, but when loaded, the panel will show that the plugin has an invalid signature. HPE has opened a case with Grafana to allow more flexibility in the root_url value used by private plugins and will continue to monitor. Until then, HPE recommends that customers either (1) set allow_loading_unsigned_plugins to true in the Grafana configuration or (2) do not load the hpeclusterview_panel plugin. 4.7.3 'cm monitoring ldms' Commands Only Work on Admin Node ------------------------------------------------------------------------------ HPCM-2491 Some 'cm monitoring' commands (e.g., 'cm monitoring kafka status') will print an error when run on the leader nodes, but 'cm monitoring ldms' currently does not. The 'cm monitoring ldms' commands should only be run on the admin nodes. HPE may provide a more appropriate error message in a future release. 4.7.4 Distributed Data Not Available on Grafana Dashboards ------------------------------------------------------------------------------ HPCM-2854 Anytime a change is made that expands the number of instances of a service that will be monitored, such as the case in distributing kafka or ELK (e.g., kfka-dist-setup, elk-dist-setup), Service Infrastructure Monitoring (SIM) must be disabled, re-enabled, and then restarted. This is also the case for adding a new leader, adding a new switch, and so on. To re-enable SIM after distributing kafka or ELK, run the following commands: # cm sim disable # cm sim enable # cm sim start 4.7.5 cm-postgresql-14 Sservice Fails to Start ------------------------------------------------------------------------------ HPCM-4012 Opensearch has a memory limit of 30G. On systems with less memory, starting at about 60G, the cm-postgresql-14 service may fail to start with an error message similar to the following: 529 -- Unit cm-postgresql-14.service has begun starting up. 530 Mar 07 09:15:45 system postmaster[522170]: 2023-03-07 09:15:45.893 CST [522170] FATAL: could not map anonymous shared memory: Cannot allocate memory 531 Mar 07 09:15:45 system postmaster[522170]: 2023-03-07 09:15:45.893 CST [522170] HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded available memory, swap space, or huge pages. To reduce the request size (currently 17436033024 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections. To work around the issue, reduce the default memory requirement of opensearch by changing the "-Xms30g" and "-Xmx30g" values in /etc/opensearch/jvm.options. 4.7.6 Default directory formating for log rotation changed ------------------------------------------------------------------------------ HPCM-4880 The default log rotation directory formatting for /var/log/HOSTS and /var/log/consoles changed in HPCM 1.10. Previously, when a file was rotated, logrotate would put the file under /var/log/HOSTS/old/YYYY-MM or /var/log/consoles/old/YYYY-MM, where "YYYY-MM" signified the year and month when the file was rotated. Starting in HPCM 1.10, the format is now per day, YYYY-MM-DD. So any day a log is rotated, it will be put under its own dated directory. The configuration for /var/log/HOSTS is in the file /etc/sysconfig/cm-logrotate-parallel-hosts on the admin node and the config file for /var/log/consoles is in /etc/sysconfig/cm-logrotate-parallel-consoles on the admin node. To modify the format of the old directory where rotated logs are moved, modify the directory-specific configuration file and change the "rotatedirname" variable under the preremove section to the desired format and comment out the previous configuration of the "rotatedirname" variable. These configs have additional examples of different formats to use. For example, to switch to a format containing both the hostname and the date, set rotatedirname in the configuration file to: rotatedirname="$(date +%Y-%m 2>/dev/null)/$(basename ${1%%-$(date +%Y%m%d 2>/dev/null)*})" Any adjustments to these configuration files will persist across upgrades. For more information, see the logrotate man page. 4.7.7 confluent-schema-registry.service shows failed state ------------------------------------------------------------------------------ HPCM-5965 If the SIM dashboard reports that the confluent-schema-registry.service is in a failed state on an SU leader, please run the following command on the su-leader to mask the confluent-schema-registry.service on that leader: # systemctl mask confluent-schema-registry.service 4.7.8 GPU native monitoring failing on EX254n platform ------------------------------------------------------------------------------ HPCM-6283 GPU native monitoring is currently failing on the EX254n platform. HPE is investigating the problem and working on plans to address the issue. 4.7.9 Native monitoring GPU_AMD_temp display is 0 on EX255a platform ------------------------------------------------------------------------------ HPCM-6291 The native monitoring system only displays zero values for "GPU_AMD_temp" on the EX255a platform. HPE is investigating the problem and working on plans to address the issue. 4.8 Command Line Interface (CLI) ------------------------------------------------------------------------------ 4.8.1 Node aliases used only by the cluster manager CLI ------------------------------------------------------------------------------ HPCM-449 HPCM provides the capability to set aliases for nodes. These node aliases can be used by the cluster manager CLI. It does not add aliases for network names. Future releases may setup /etc/hosts and or DNS based on node aliases on a specific network for that node. Site administrators are free to add specific entries to /etc/hosts and distribute that across the cluster. 4.8.2 Console failure due to delayed credential updates ------------------------------------------------------------------------------ HPCM-4017 When adding nodes to the system, it is best to add the nodes with the controller credientials (BMC, iLO, etc) at the time of adding the node. This ensures the database instantly has the credentials necessary for power and console services. For example, if you are using a cluster definition file to add a set of nodes, you should include bmc_username and bmc_password if you have them. When you do not have the the credentials, a service tries to guess common controller usernames and passwords until a match is found. This is done against any node that lacks credientials when they are added. This process takes time. So if you add nodes and do not specify the controller credentials, certain services may not start with the right information. This can be observed in the console service. You may find that the console service starts with missing credentials which leads to error messages. If you hit that situation, you can just update the configuration a bit later after the correct credentials have been guessed (if possible) like this: # cm node update config --sync conserver -n admin If your system has SU Leaders, # cm node update config -t role su-leader --sync conserver 4.9 Graphical Interface (GUI) ------------------------------------------------------------------------------ 4.9.1 Image Management in the GUI ------------------------------------------------------------------------------ The Image Management section of the GUI has been deprecated. HPE plans to make updates in a future version. For now, HPE recommends that users create images using the CLI. 4.9.2 History View of node information is not displayed between time ranges ------------------------------------------------------------------------------ HPCM-4747 Note that when using the History View in the GUI, the time entered in the dialog box is evaluated with the timezone of the server, not with the time of the laptop/desktop running the GUI. 4.10 Diags and Firmware ------------------------------------------------------------------------------ 4.10.1 Omni-Path firmware flash operation fails ------------------------------------------------------------------------------ IM#1001790304 When attempting to flash firmware on the Omni-Path cards, you may see a failure like the following: service1:~ # hfi1_eprom -d all -u /usr/share/opa/bios_images/HfiPcieGen3_1.9.2.0.0.efi Updating driver file with "/usr/share/opa/bios_images/HfiPcieGen3_1.9.2.0.0.efi" Using device: /sys/bus/pci/devices/0000:12:00.0/resource0 Unable to mmap /sys/bus/pci/devices/0000:12:00.0/resource0, Invalid argument This is a known issue and the workaround is documented online: https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00060145en_us 4.10.2 HPE Apollo 9000 Tray Power-Down and Failure to Power-Up ------------------------------------------------------------------------------ HPCM-2766 HPE has observed that occasionally one or more trays will power down all the nodes in the tray and then will not allow the nodes to be powered back up. This issue is fixed with an update to the HPE Apollo Chassis Management Controller firmware (CMC_20221121s.bin). 4.11 Miscellaneous ------------------------------------------------------------------------------ 4.11.1 AIOps version information ------------------------------------------------------------------------------ IM#1001787597 The AIOps feature does not currently include a release file in /etc for simple identification. When providing feedback on the AIOps feature, please provide both the HPCM version (e.g., HPCM 1.11) and the version number of the cm-aiops RPM (e.g., 1.11-729). 4.11.2 PBSPro: nodes are in offline state after OS provisioning ------------------------------------------------------------------------------ IM#1001790902 While testing the HPCM PBSPro connector, HPE noticed that nodes would always show 'offline' after PBS OS provisioning, even though PBS services are running on the compute nodes. The problem is that the pcm_provision alarm default is set too low (30). To workaround the problem, set the hook pcm_provision alarm to 1200. HPE is working with Altair to add this to the PBS connector documentation. 4.11.3 SSH Key Content in Dump Does Not Match Expected ------------------------------------------------------------------------------ HPCM-665 SSH key content in /opt/sgi/secrets/root-ssh/sgi_kdump/ may be different from the content saved to /var/crash/sgi_kdump/.ssh/. The difference can cause confusion when debugging, but it does not cause a kdump failure. HPE continues to investigate the issue. 4.11.4 PCIM Documentation ------------------------------------------------------------------------------ HPCM-1438 The documentation provided in the PCIM package at /opt/cmu/pcim/docs/Apollo_System_Manager.pdf calls out the Apollo System Management tool. Updates to the documentation are planned for a future release. 4.11.5 kdump fails/panics on init ------------------------------------------------------------------------------ HPCM-3746 Driver issues can sometimes cause kdumps to fail. After initiating a dump with 'echo c > /proc/sysrq-trigger', all nodes may eventually panic/stop during init. The error message most often displayed will be similar to the following: BUG: unable to handle page fault for address: ffffffffffffffc8 Additional drivers may be needed in order for kdump to work on specific hardware configurations. 4.11.6 cpasswd command reports python deprecated warning ------------------------------------------------------------------------------ HPCM-6177 On some systems, the cpasswd command may display the following warning: # cpasswd -N node002 /opt/sgi/sbin/cpasswd:21: DeprecationWarning: 'crypt' is deprecated and slated for removal in Python 3.13 import crypt, getopt, getpass, logging, os, random, re, string, sys, traceback Enter new password: Enter new password (again): The warning does not impact functionality and may be ignored. HPE will look into addressing the warning in a future release. 4.11.7 iscsi errors when an image has not been activated ------------------------------------------------------------------------------ HPCM-6116 When using the iscsi diskless function for compute nodes, the following type of messages may appear in /var/log/messages on an admin or leader node: 2024-02-20T09:35:44.390138-06:00 leader3 kernel: [1202340.564016][T463167] Unable to locate Target IQN: iqn.1995-03.com.hpe.cm:windom-sles15sp5-ssdkms.squashfs in Storage Node 2024-02-20T09:35:44.390150-06:00 leader3 kernel: [1202340.575256][T463167] iSCSI Login negotiation failed. This type of messages is generated when a node is attempting to boot an image that has not been activated. To track down the node that may be causing this, check for nodes assigned to the image. For example, the following command: # cm node show --image | grep windom-sles15sp5-ssdkms will list the nodes with the windom-sles15sp5-ssdkms image. Find one that is attempting to boot and check the console log for messages such as: ro-root-tmpfs-overlay: ISCSI - LOGIN: iscsiadm --mode node --portal 172.23.255.241 --targetname iqn.1995-03.com.hpe.cm:windom-sles15sp5-ssdkms.squashfs --login Logging in to [iface: default, target: iqn.1995-03.com.hpe.cm:windom-sles15sp5-ssdkms.squashfs, portal: 172.23.255.241,3260] iscsiadm: Could not login to [iface: default, target: iqn.1995-03.com.hpe.cm:windom-sles15sp5-ssdkms.squashfs, portal: 172.23.255.241,3260]. iscsiadm: initiator reported error (8 - connection timed out) iscsiadm: Could not log into all portals ro-root-tmpfs-overlay: iscsiadm login error. As shown in the example below, the 'ls' command confirms that the image exists on the admin node, but that there is no squashfs found for it, as shown by the second 'ls' command failure. # ls -ld /opt/clmgr/image/images/windom-sles15sp5-ssdkms ; ls /opt/clmgr/image/image_objects/windom-sles15sp5-ssdkms.squashfs drwxr-xr-x 22 root root 276 Feb 20 09:51 /opt/clmgr/image/images/windom-sles15sp5-ssdkms ls: cannot access '/opt/clmgr/image/image_objects/windom-sles15sp5-ssdkms.squashfs': No such file or directory To fix the issue, use the 'cm image activate' command to activate the image. 4.11.8 systemd-sysv-generator messages generated in console ------------------------------------------------------------------------------ HPCM-6253 In some distributions, users may see messages similar to the following: SysV service 'X' lacks a native systemd unit file. Automatically generating a unit file for compatibility. Please update package to include a native systemd unit file, in order to make it more safe and robust. on the console and in console logs. These messages are safe to ignore. HPE will investigate these messages for possible resolution in a future release. 4.11.9 Unable to create RHEL9.x images in Q-HA admin virtual machine ------------------------------------------------------------------------------ HPCM-6314 HPE has confirmed a customer report regarding an inability to create RHEL 9.X images on the Quorum-HA admin virtual machine. This failure is caused by missing CPU features in the /adminvm/adminvm.xml file on the physical admin nodes. To workaround the problem, additional features must be added to the /adminvm/adminvm.xml configuration file on the Q-HA physical admin nodes. The cpu mode section needs to be updated to include (denoted with "+" in the example below) the last four (4) lines: qemu64 + + + + The virt HA resource will need to be stopped and started again on the Q-HA node for this to take affect using either pcs or crm. Doing that will reboot/restart the admin node. 4.12 Ubuntu ------------------------------------------------------------------------------ HPCM 1.11 provides improved support of Ubuntu Server support on the x86_64 architecture only. Please note the following limitations: - Upgrading HPCM 1.9 Ubuntu images is not supported Because Ubuntu was supported on the HPCM 1.9 release as a technical preview, upgrading existing Ubuntu HPCM 1.9 images is not officially supported. HPE recommends creating new HPCM 1.11 Ubuntu images for service/compute nodes. See the "Upgrading Ubuntu compute nodes" section in the HPCM 1.11 Upgrade Guide for more information on upgrading HPCM 1.10 Ubuntu compute nodes. - Supported on service/compute nodes only Ubuntu is only supported on service/compute nodes, not on management nodes such as admin nodes and su leaders. HPCM support is limited to long-term support (LTS) releases. - Requires Ubuntu support for the platform HPCM support for Ubuntu on any given compute platform is contingent on the platform itself being supported on Ubuntu (see the HPE Servers Support & OS Certification Matrix for Ubuntu for details): https://techlibrary.hpe.com/us/en/enterprise/servers/supportmatrix/ubuntu.aspx - Other HPE software products may not support Ubuntu HPE software products such as HPE Slingshot, HPE Cray Programming Environment and HPE Message Passing Interface (MPI) may not provide support on Ubuntu releases. - Some other vendors may also support Ubuntu Other vendors of note that provide Ubuntu support include NVIDIA (drivers, HPC SDK, CUDA), Mellanox Infiniband, Intel and AMD. - Ubuntu is an internet-based distribution While Red Hat Enterprise Linux and SUSE Linux Enterprise Server provide installation ISOs containing many packages, the Ubuntu Server ISO image has very few packages, most of which are pre-installed in a squashfs image. Ubuntu expects packages to be retrievable, either from the internet or via local mirrors. HPCM has attempted to make a basic compute node installable without internet access by including required Ubuntu packages on the HPCM repository setup ISO (e.g., cm-1.10.0*.iso) itself. Customers wishing to further customize Ubuntu compute images will need to either mirror Ubuntu repositories on the admin node or provide internet access to remote repositories. 4.12.1 No Support for AIOPs ------------------------------------------------------------------------------ The AIOps features are not supported with Ubuntu. 4.12.2 named package not installed by default ------------------------------------------------------------------------------ HPCM-3020, HPCM-3182 Ubuntu does not install named by default, so it is not included in HPCM default Ubuntu compute node images. On compute nodes, named support is only used when secondary/backup DNS support is needed and administrators do not wish to use leader nodes for this task. 4.12.3 nscd package not installed by default ------------------------------------------------------------------------------ HPCM-3018 The nscd package is provided in the Universe (community-support) repository, so it is not included in HPCM Ubuntu images by default. If desired, it can be added to images. 4.12.4 FIPS support for Compute Nodes running Ubuntu ------------------------------------------------------------------------------ HPCM-2398 To enable FIPS on Ubuntu, refer to the available Ubuntu documentation: ubuntu.com/security/certifications/docs/fips-enablement Please note that switching the system to contain the FIPS certified packages cannot be easily undone. HPE recommends to use a testing system for experimentation before trying on production. 4.12.5 HPCM packages for Ubuntu are not signed ------------------------------------------------------------------------------ HPCM-3312 Unlike the HPCM packages built for RPM-based distributions, which are all signed with HPE digital keys, the HPCM packages built for use on Ubuntu are not signed. Like the Ubuntu release itself, the release file is signed. The HPE digital keys are installed on by the hpe-build-key package. 4.12.6 cm image 'rpmlist' option not supported ------------------------------------------------------------------------------ HPCM-2085, HPCM-3706, HPCM-3179, HPCM-3945 The 'rpmlist' option to the 'cm image' command is not supported with Ubuntu images at this time. Attempts to use the command with Ubuntu images will fail as follows: system:~ # cm image rpmlist -i ubuntu2204-x86_64 -W /root/ubuntu_rpmlist_version.txt --rpm-version error: ubuntu2204-x86_64 is an Ubuntu target which is not rpm based. HPE may address this in a future release. 4.12.7 NFS image provisioning with overlay on compute node fails ------------------------------------------------------------------------------ HPCM-5039 Any nodes assigned with an Ubuntu-based OS image and any expanded overlay writable type NFS rootfs (either tmpfs-overlay or nfs-overlay) will fail to boot and fall to a miniroot shell. HPE recommends the use of an 'overmount' NFS writable type or an image object by 'activating' the image. The image can be set to overmount using the following command for nfs overmount, where is the node expression matching the nodes to set the new writable type: # cm node set --rootfs nfs --writable nfs-overmount Or for tmpfs r/w area with overmount type: # cm node set --rootfs nfs --writable tmpfs-overmount -n Or activate the image using the following command, where '' is the name of the image to activate: # cm image activate -i ****************************************************************************** 5.0 Feedback ****************************************************************************** Hewlett Packard Enterprise is committed to providing documentation that meets your needs. To help us improve the documentation, send any errors, suggestions, or comments to Documentation Feedback (docsfeedback@hpe.com). When submitting your feedback, include the document title, part number (if applicable), edition, and publication date located on the front cover of the document. For online help content, include the product name, product version, help edition, and publication date located on the legal notices page. ****************************************************************************** 6.0 Appendix ****************************************************************************** 6.1 Notes on Using Unsupported or Unmanaged Network Switches with HPCM ------------------------------------------------------------------------------ HPE Performance Cluster Manager 1.11 supports the following switches: - LG/Edgecore ECS4610-26T/ECS4610-50T: 1.4.2.25 (Final) - Extreme X440/X460/X670: 16.2.5.4 - Extreme X440-G2/X460-G2/X670-G2: 22.7.2.4 - HPE FlexNetwork 5510: 7.1.070 Release 3507P18-US/3507P09 - HPE FlexFabric 5710: 7.1.070 Release 6710P03/6710P03-US - HPE FlexFabric 5900/5920: 7.1.045 Release 2432P61-US/2432P61 - HPE FlexFabric 5940 48SFP+/6QSFP+ or 48XGT/6QSFP+: 7.1.070 Release 2612P10-US/2612P10 - All other FlexFabric 5940 Models: 7.1.070 Release 6710P01-US/6710P03 - HPE Flexfabric 5945: 7.1.070 Release 6710P03/6710P03-US - HPE FlexFabric 5950: 7.1.070 Release 6301/6301-US - Arista DCS-7010T-48: 4.21.6F - Aruba 6300M, 8320, 8325, 8360: FL/GL/LL/CL 10.12.1021 Some advantages to using supported Ethernet switches are: - You can use cluster manager tools, such as switchconfig, to manage the switch. - The cluster manager configures the supported switches with settings that segregate cluster management traffic from application data traffic and settings that support efficient transfer of operating system images. - With a supported switch, a command exists that allows users to configure various settings automatically for either management switches, compute or leader nodes: # switchconfig_configure_node --node [--dry-run] Using "--dry-run" allows you to see what commands would be run before actually running them, which is the safer option when running this command. If you use unsupported switches, you need to use its commands to complete some configuration steps manually. Unsupported switches are included in the cluster as unmanaged switches. For these switches, the cluster manager does not attempt to automatically configure any switch settings. The following procedure explains how to configure an unsupported switch into a cluster. Configuring a cluster that uses an unmanaged switch: 1. Complete the installation instructions according to the HPE Performance Cluster Manager Installation Guide, but stop before you run the discover command. 2. Enter the following command to preserve the settings on the unsupported management switches: # cadmin --enable-discover-skip-switchconfig This command prevents the cluster manager from logging into management switches at a global level, which allows you to configure the unsupported switches later in the installation. 3. Configure the switches for multicast or configure the cluster manager to use unicast. This step ensures that the leader and compute nodes receive their images from the admin node in an efficient manner. Do one of the following: - Verify whether the unsupported switch is configured for "IGMP" and "IGMP Snooping", and configure those two settings if they are not in effect at this time. The cluster manager uses a multicast protocol called udpcast to image leader and compute nodes during the boot process. For multicast to be successful, the management switches must support IGMP and IGMP Snooping. For information, see the switch configuration documentation. Or - Configure the cluster manager to use Rsync or BitTorrent when it images the compute nodes. Rsync and BitTorrent are not a multicast methods and instead uses unicast. For information about how to change the method by which the leader and compute nodes receive images, see the HPE Performance Cluster Manager Installation Guide. 4. Complete the rest of the installation procedure, beginning with running the discover command, to configure the rest of the components into the cluster manager. The discover command configures supported switches and all other components to be under cluster manager control. Because you ran the "cadmin enable-discover-skip-switchconfig" command before you ran discover, the discover command allows DHCP to assign supported switches an IP address so that you can SSH or Telnet to the supported switches if necessary. 5. (Optional) Enable DHCP on the switch. See the documentation for the unsupported switch for information. DHCP enables the cluster manager to assign an IP address to the switch. You might need to enable either Telnet or SSH, and then create a remote username and strong password in order to manage these switches remotely. 6. (Optional) Enable independent management of the unsupported switch by completing one or both of the following tasks: - Enable either Telnet or SSH and then create a remote username and strong password for the switch. These credentials enable you to manage the switch remotely. - Enable DHCP on the switch. DHCP enables the cluster manager to assign an IP address to the switch. For more information, see the switch documentation. 6.2 Supported Power Distribution Units (PDUs) ------------------------------------------------------------------------------ A power distribution unit (PDU) reads AC power and energy measurements on cluster rack-level power domains. For the AC power measurement feature to function, the cluster must have one or more of the following PDUs: - Server Technology Sentry3 - Server Technology Sentry4 - 880459-B21 (Raritan) HPE Mtrd 3P 39.9kVA/60A 48A/277V FIO PDU - PX-5946V-F5V2 (Raritan) HPE Mtrd 3P 17.3kVA/48A 9brkr PDU - P9R82A HPE G2 Metered 3Ph 17.3kVA/60309 4-wire 48A/208V - P9R84A HPE G2 Metered 3Ph 22kVA/60309 5-wire 32A/230V For more details on power management, see the HPE Performance Cluster Manager Power Consumption Management Guide. 6.3 HPE Power and Cooling Infrastructure Monitor (PCIM) Supported Devices ------------------------------------------------------------------------------ The HPE Power and Cooling Infrastructure Monitor provides insight into the state of the hardware related to the power and water-cooling components of an HPE water-cooled solution. Supported devices include the following: - HPE Apollo 9000 CDU (Cooling Distribution Unit) - HPE Apollo 9000 Chassis (Power Supplies and Switches) - HPE Cray EX CDU (1.2 MW and 1.6 MW) - Apollo DLC Passive CDU (for A2k and A6500 clusters) - HPE SGI 8600 CDU - ARCS (Adaptive Rack Cooling System) - SGI 8600 CRC (Cooling Rack Controller) - Motivair RDHX (Rear Door Heat Exchanger) - Raritan PDUs (Power Distribution Unit) - HPE Cray EX VCDU (Virtual Cooling Distribution Unit) - HPE PDUs - ServerTech Cray ClusterStor Switch 63A 400V PDU (R4M34A) - ServerTech Cray ClusterStor Switch 60A 415V PDU (R4M35A) 6.4 HPCM Update Repository Guide ------------------------------------------------------------------------------ HPCM update repositories are hosted on the HPE Software Delivery Repository (SDR). Patches for HPCM releases (available for both x86_64 and aarch64 architectures) are available on the SDR. 6.4.1 Accessing the HPCM Update Repository the First Time ------------------------------------------------------------------------------ To access the HPCM updates on the SDR, you must have the following items: - HPE Account - HPE Support Center User Token - HPCM Service Agreement ID (SAID) - Support Account Reference (SAR) entitlements The SAR entitlements must be linked to both an HPE Account and to an HPE Support Center User Token to gain access. In some cases, multiple HPE Accounts can be linked to the same SAR entitlements to allow multiple users access. 6.4.1.1 Creating an HPE Account ------------------------------------------------------------------------------ (1) Go to Create a new account and enter all required information. (2) Select the "Provide additional business contact information" option at the bottom of the page, and enter your business contact information. ** You must enter a value for the "Company name." Not completing this ** field will prevent you from logging in to the HPCM repository on the ** SDR. (3) Select "Create account." 6.4.1.2 Linking entitlements to an HPE Account ------------------------------------------------------------------------------ (1) Go to the HPE Support Center at https://support.hpe.com/ (2) Select the "Preferences" icon, and then select "Sign in." (3) Enter the user ID and password for your HPE Account, and select "Sign In." (4) In the "Toolkit Library", select "My Contracts." (5) Select "Add a Support Agreement," and enter your Service Agreement ID (SAID) and your Support Account Reference (SAR) in the required fields. (6) Choose "My Group (Private)" as the "Group." (7) Select "Next" and verify your "Contract ID/SN" information. 6.4.1.3 Logging in to the HPCM Repository on the SDR ------------------------------------------------------------------------------ * The example below uses the HPCM 1.9 release as an example. Adjust version numbers in the examples below accordingly for the HPCM 1.10 release. (1) Create an HPE Support Center User Token at https://hpsc-pro-site1-hpp.austin.hpe.com/hpsc/swd/entitlement-token-service/generate A token will automatically be created after you log in with your HPE Account. However, you must wait an hour for the token to be activated before using it. You can save this token to use for all future SDR logins, or you can create a new token for each login. However, you must wait an hour for each new token to be activated. Creating multiple tokens does not nullify any of the previously created tokens. All tokens are active and valid for authentication. (2) Log in to the HPCM Repository on the SDR at https://update1.linux.hpe.com/repo/hpcm/ Enter your HPE Account username (email address) in "Username," and enter your HPE Support Center User Token in "Password." Contact a support representative if you cannot log in to the HPCM repository after waiting an hour for a new HPE Support Center User Token to be activated. Upon a successful login, you should see a directory listing similar to the following: Index of /repo/hpcm * Parent Directory * centos/ * rhel/ * rocky/ * sles/ HPCM updates for releases supported on the SLES15 SP4 operating system on the x86_64 architecture, for example, are available at the following location: https://update1.linux.hpe.com/repo/hpcm/sles/15sp4/x86_64/ Index of /repo/hpcm/sles/15sp4/x86_64 * Parent Directory * 1.9.0/ * 1.10.0/ Further selecting the HPCM 1.9.0 release directory will display the updates applicable to the HPCM 1.9 release on SLES15 SP4: https://update1.linux.hpe.com/repo/hpcm/sles/15sp4/x86_64/1.9.0/ Index of /repo/hpcm/sles/15sp4/x86_64/1.9.0 * Parent Directory * 11778/ * 11779/ * repodata/ 6.4.2 Mirroring an HPCM Repository on the SDR ------------------------------------------------------------------------------ It is also possible to mirror an HPCM update repository to a local system with a simple shell script. For example, the following is a shell script to mirror the HPCM 1.9.0 update repository on a local system: #!/bin/sh umask 022 USERNAME="" PASSWORD="" BASEURL="update1.linux.hpe.com/repo/hpcm/" cd / wget --no-parent -nH -r -c --cut-dirs=1 --auth-no-challenge \ --user=${USERNAME} --password=${PASSWORD} \ https://${BASEURL}/sles/15sp4/x86_64/1.9.0/ Tailor the above script example to meet any site specific requirements. 6.5 List of CASTs Addressed in HPCM 1.11.0 ------------------------------------------------------------------------------ The following CASTs were closed out with the HPCM 1.11 release: CAST-32447 HPCM SIM shows 2 (of 9) switches down, start as up on reboot but show down over time CAST-32556 Problems when using hpemon and trying to get the Sec running on the leaders due to site restrictions on ssh between cNs CAST-32621 [RFE] Add the DNS search path to cminfo via a new cm-configuration script CAST-32639 HPCM 1.6 - cm health alert configuration assistance needed CAST-33302 1.8/1.9: /etc/prometheus/snmp.yml is missing auth: community: default-community for the flexnetwork_mib CAST-33681 [RFE] HPCM provided gpu_sizzle should be statically compiled if possible (highly preferred) CAST-33772 Package conflict during admin node upgrade (HPCM 1.8 / rhel 8.6) CAST-33795 Kafka Connect: ERROR: index row size 3176 exceeds btree version 4 maximum 2704 for index "label_pkey" CAST-33987 If admin kernel version = image kernel version, cannot cm dnf remove the kernel from the image CAST-34052 cm command line should not allow underscore in network names - breaks named CAST-34626 RHEL: mariadb error when cloning slot on admin CAST-34978 Gluster NFS allows unprivileged mounts 6.6 List of Issues Addressed in HPCM 1.11.0 ------------------------------------------------------------------------------ Incident numbers from HPE's tracking system are provided for reference: HPCM-219 HPCM: cm image yum/dnf/zypper node/image: group/pattern install needs examples for installing with group names containing spaces HPCM-289 checkDbReady doesn't work, only returns true, should have a proper, global implementation to check if the database is operational HPCM-425 script to rebuild kafka cluster HPCM-485 conserver needs to support sending a break signal on Cray hardware HPCM-538 Enable JMX metrics to be gathered from cmdb (Java Grizzly webserver) HPCM-547 Slingshot Telemetry- No way to activate the Inactive configurations: HPCM-824 AMD GPU monitoring needs support for alerts HPCM-1054 Add ability to set Native monitoring SEC priority via the cm-cli HPCM-1114 Timescale monitoringdb Postgres Users and Schema HPCM-1232 Take all logs from /opt/clmgr/log to Elastic HPCM-1320 Q-HA: setup fails to ask for independent BMC network when using interactive mode HPCM-1680 add SUDO_USER to cm.log HPCM-1765 Add Flashing support for AMI MegaRAC based Gen 11 aka Cray XD computes HPCM-1907 Q-HA, RHEL 8.6, ifconfig/ipaddr deprecated or removed from distro HPCM-2012 gluster-exporter causes "gluster volume status" to continuously say that locking failed HPCM-2319 Switch default ICE diags in rpmlists to xe diags for admin/default HPCM-2427 HPCG for AMD Gpus HPCM-2428 HPCG for nvidia gpus HPCM-2500 Slingshot Reporting - Report ports with ber and tx/rx pause HPCM-2589 Can't upgrade iLO firmware via cfirmware HPCM-2650 Grafana dashboard - Top level view of link and switch state HPCM-2749 /opt/clmgr/lib/cluster-configuration command prints one error message HPCM-2900 HPCM1.8: SLES15SP4: glusterd seems to get stuck and not launch all brick processes so mounts get stuck after su-leader reboot. ctdb status remains DISCONNECTED for some leaders HPCM-2957 XD6500/XD670 B1 SPR Support on HPCM: Compute/Service HPCM-3213 pdu-collect: Add PCIM metrics HPCM-3573 Put rhel8 monitoring rpms back to Recommends HPCM-3590 conserver is wide open access-wise including from compute HPCM-3780 SLES15 SP4: Booting q-ha cloned slot results in unknown network interface, PNN disabled HPCM-3886 Rework log writing to Kafka HPCM-3916 Native monitoring fails on computes with a dedicated custom user HPCM-3935 Upgrade sdu components HPCM-3999 cm-power-service: new node route. (like controller/chassis) HPCM-4053 Improve NativeProcessExecutor with Java 9+ ProcessHandle API HPCM-4073 Online diags EX255a support HPCM-4075 CHC Add AMD MI300 tools HPCM-4080 Enhance error handling in SS health reporting HPCM-4095 Report ports with UCW and llr_replay errors HPCM-4096 Report ports which are not configured HPCM-4124 HPCM CrayEX Hardware Dashboard Incorrect HPCM-4138 Native monitoring can't start due to various file permission errors HPCM-4139 If MONITORING_SECMD_PRIORITY is set, it is not honoured HPCM-4209 Enhance CrayEx HW alerts (CEC/BMC) as per customer requirements HPCM-4223 Add etcd to HPCM needed for system power capping support HPCM-4241 System Power Capping support in HPCM HPCM-4243 Provide CM Node Power cap for System Power Capping HPCM-4244 Provide an inventory interface for System Power Capping HPCM-4268 Integrate HPE Cluster View Dashboard Automation (mPhasis) w/ HPCM HPCM-4291 Productize and generalize power map and integrate it into the cli HPCM-4300 [RFE] Add the DNS search path to cminfo via a new cm-configuration script HPCM-4303 Add support for EX235n to the Hardware Triage Tool HPCM-4307 Add support for EX425 to the Hardware Triage Tool HPCM-4308 Add support for Parry Peak to the Hardware Triage Tool HPCM-4337 Performance tools - Shibuya Stream testing all xgmi links between sockets HPCM-4353 asyncssh: ssh_cmd replacement for tnet_ssh HPCM-4362 Improve cm monitoring ss config command for SS 2.2 HPCM-4377 Opensearch Grafana Dashboards giving Unexpected Error HPCM-4379 cm image show -d should show complete image size HPCM-4419 Build Diagnostics using EX254n blade in snowdon HPCM-4427 Collect and Test PDU Metrics HPCM-4430 persist logs of /opt/clmgr/log from admin/leaders HPCM-4431 HPCM 1.10: cm monitoring timescaledb show requires a default behavior when no option is specified HPCM-4434 HPCM 1.10: cm monitoring timescaledb retention does not update the retention period HPCM-4444 hpe-python: Add aiofiles HPCM-4447 Run all arm supported diagnostics on EX254n node HPCM-4449 EX255a: Upgrade AGT binary HPCM-4452 EX255a: Check rectifier status and telemetry for issues, Make sure they are running and balanced HPCM-4458 Check APUs are meeting minimum performance  HPCM-4459 Check APUs are meeting minimum HBM Performance HPCM-4460 Check NICs are meeting minimum performance and bandwidth HPCM-4462 Upgrade Telegraf to Latest Version in HPCM HPCM-4472 EX255a- Add wrapper for mpiBench,presta and sqmr to Online diags HPCM-4526 jobmonitor.conf: Add rest_server_Ip option. HPCM-4554 Configure etcd for system power capping support HPCM-4599 Add cli tool to interface with clmgr-power REST API HPCM-4601 chassis_routes: Added Perif ops HPCM-4602 healthcheck and fix for bad kafka topics HPCM-4606 SIM dashboard are not enabled after upgrade HPCM-4624 Triage tool kit should accept log folder as input and analyze the hardware failure HPCM-4633 Alerting rule reference file needs more comments and schema file needs doc string HPCM-4638 Make updating head/head-bmc networks easier in configure-cluster HPCM-4650 Change names for input.yml and input_on.yml for better understanding HPCM-4656 Change the on and off flow as part of revised workflow HPCM-4657 Collect logs and serial information of all types of node whether supported or not HPCM-4675 optimize alerting spec file HPCM-4741 Make changes to the WLM dashboards and Logstash files to accommodate remlog-collect to Telegraf transition for wlm (slurm/pbs) monitoring HPCM-4759 Support chassis types in cm controller, chassis types have no mechanism to update credentials from the CLI HPCM-4799 Upgrade procedure for quorum ha physical nodes HPCM-4820 Need cfirmware solution for XD6500/XD665 M4 Genoa HPCM-4851 cm node template show '--bmc-info' flag should be '--credentials' HPCM-4866 Segregate wlm monitoring from cluster-health HPCM-4882 Increase task.shutdown.graceful.timeout.ms HPCM-4894 clmgr-power REST API: add BMC type HPCM-4904 Slingshot Health Reporting and Alerts - Phase 3 HPCM-4917 HPCM1.10: Grafana alert dashboard to include a link to the Alertmanager page HPCM-4967 RHEL/ROCKY 9.3 Support (Compute/Service nodes only) HPCM-4968 RHEL/ROCKY 8.9 Support (includes TOSS 4.7) HPCM-4969 hpe python include dbus_next library HPCM-4975 HPCM fabmgmt: minor dialog menu improvement HPCM-4990 nfs-ganesha has bad systemd unit file fro nfs-ganesha-lock.service HPCM-5019 HPCM1.11/aarch64/apollo70/apollo80: Does not boot with experimental- grub-cm-arm64.efi HPCM-5022 HPCM1.10: cm image set command displayed unwanted output of associated repo group HPCM-5033 WLM Telemetry Fails to Write to Timescale HPCM-5044 support 'slingshot' network types for hsn dns entries HPCM-5061 HPCM1.10:Cluster health-Verify AMD GPU dgemm and stream test failed. HPCM-5077 SAC-HA: SLES15 SP5: 30-virt-setup deprecated commands HPCM-5092 Minor edit for cluster-configfile manual HPCM-5093 Add option to assign controllers to computes in cm_create_fake_ configfile HPCM-5114 Add nvidia HPL for EX254n HPCM-5122 Hardware triage tool: validate 'On' flow HPCM-5124 EX254n: validate 'On' flow HPCM-5125 EX4252: validate 'On' flow HPCM-5126 EX255a: check node health HPCM-5128 EX4252: check node health HPCM-5141 Add EX254n diagnostics HPCM-5151 ROADMAP: XD224 Support HPCM-5152 HPCM Slingshot hardware dashboards: Correct few issues HPCM-5176 EX255a cbios support HPCM-5181 Remove --noplugins flag from being added to cinstallman rpmmgr image yume commands HPCM-5186 New async_apis rpm HPCM-5190 HPCM1.10: Netchk reports inaccurate errors in the log files of EX254n/EX255a nodes. HPCM-5191 HPCM1.10:ARM64: (memchk)Memory size and DIMM speed are not reported in EX254n nodes. HPCM-5197 HPCM doc work for COS refactoring to CNE/COS-base HPCM-5202 Permissions on files under HOSTs can vary HPCM-5210 HPCM1.10: Obtaining the CPU of a NUMA Ppin results in an error output for CN's. HPCM-5212 Identify different BERT and MCE Errors and Repair action HPCM-5213 cmutils (twisted/async) removed errand run_cmu_expand HPCM-5221 HPCM Unified Alerting - Phase 2 HPCM-5225 Need the Python Library requests-toolbelt 1.0.0 added HPCM-5226 GUI is throwing and freezing when a metric max value is set to 0 HPCM-5232 Support NVME disks in diskchk, diskperf and fsperf HPCM-5233 Add 'loop' and 'fabric' parameters to cpuperf, cwcpuperf and fabricperf HPCM-5235 Add OpenBMC default creds to power service credential detect HPCM-5237 Chassis System Group regression from HPCM-5224 HPCM-5242 HPCM1.10: SIM: logstash-exporter messages continue flooding in /var/log/messages after adding monitoring-services group in SIM HPCM-5246 conserver reload issue on large cluster HPCM-5247 system monitoring (cn): timescaledb not showing data on Grafana Dashboard HPCM-5249 Set up DNS server correctly for cray-sdu-rda container for HPCM HPCM-5251 cm-network-show man page has incorrect flags for controllers and configfile HPCM-5254 HPCM1.10-slurm/jobmonitor/grafana - dashboards have incorrect or missing partition information HPCM-5260 change Stuck_in_bios_boot logic for EX4252 HPCM-5261 update Check_PCIe_Missing for EX4252 HPCM-5264 power fault issue: empty PWR_STS_CAP HPCM-5265 blade latch test HPCM-5266 pmbus decoder: Repair actions should be called for bit 6 HPCM-5268 change log path to /var/log HPCM-5269 Identify DIMM failures HPCM-5270 SIVOC failed to power up the 48V HPCM-5277 Routing and unrouting alerts to kafka and opensearch HPCM-5282 Add repair action for RAS poison error HPCM-5294 confluence page to map OS versions to OFED, CUDA and ROCM versions and Download links HPCM-5297 asyncio_cmdb add to_thread for io blocking functions HPCM-5300 Identify os kernel panic HPCM-5301 change Unexpected_Booted pattern from 'warm reset' to reset HPCM-5302 EX4252: Check_PCIe_Missing HPCM-5303 EX4252: Unable_to_apply_bios (validation) HPCM-5304 EX4252: Stuck in UEFI shell HPCM-5306 Use async-apis/asyncio_cmdb instead of tlib/asyncio_cmdb_utils HPCM-5309 Remove IB mention in cluster-configfile man page HPCM-5312 IPID decoder for bardpeak HPCM-5314 Identify Missing NMC HPCM-5320 cm_util clustershell group processing not working correctly HPCM-5325 Routing and unrouting alerts to Slack HPCM-5326 enabling all rules should be re-worked when there is a validation failure in the middle. HPCM-5327 Alertmanager email routing: Separate notification for warning alerts to diff recipients HPCM-5328 DOC: Support the corelated alert rule configuration for opensearch alerting. HPCM-5329 cm health alertman: csv / json/ text dump of alerts HPCM-5330 Convert the SS Cassini error from elastalert to new Unified Alerting Infra HPCM-5331 Provide a framework for timescale alerting HPCM-5335 ClusterShell.CmUtil user and ssh options fields not working HPCM-5337 aiclientsession: Add more kwargs filters HPCM-5345 Remove agt & AMDXIO from stout728 HPCM-5346 HPCM 1.10: Replace underscore with dash in hwtriage options to conform to other GNU CLI options. HPCM-5349 Serial Numbers information file is getting printed while using log_path HPCM-5351 Increase Prometheus scrape interval HPCM-5352 /etc/prometheus/snmp.yml is missing auth: community: default- community for the flexnetwork_mib HPCM-5354 DOC: HPCM 1.10: Provide a method to disable the opensearch retention policy HPCM-5356 HPCM1.10: When heartbeat elk indices are generated, the node down/up Alert Rules Status is not updated HPCM-5357 HPCM 1.10: cm monitoring kafka status reports that aarch64 is not supported even though kafka topics report entries from ARM nodes. HPCM-5361 Block zookeeper startup when slots don't match HPCM-5362 Fix cluster.id reset for confluent-kafka service HPCM-5364 Remove push_key.py from clmgr-power HPCM-5371 Upgrading systemimager-server on su-leaders can hang or fail making upgrade difficult HPCM-5374 Handle WNC for supported cardname HPCM-5375 Copy /tmp/miniroot-mgmt-network-device to /opt/clmgr/etc to handle upgrades HPCM-5378 HPCG-local: If job fails on one node due to UME slurm kills off on all other node HPCM-5379 Procedure to upgrade an ubuntu compute image and node HPCM-5380 HPCM 1.10: cm aiops enable should remove dependency on alerta HPCM-5382 cfirmware: ModuleNotFoundError - 'requests_toolbelt' HPCM-5383 Add library dependency for cfirmware to the sgi-talib.spec.in HPCM-5387 If admin kernel version = image kernel version, cannot cm dnf remove the kernel from the image HPCM-5388 Handle timeout Error HPCM-5389 Show MCE errors to console HPCM-5390 run on branch after off branch HPCM-5391 Support adding list of bios versions in hardware.yml file HPCM-5392 triage_output.json contains extra colon HPCM-5393 Ubuntu image update from HPCM 1.9 -> HPCM 1.10 skips updating sgi-service-node and sgi-csn HPCM-5394 Ubuntu upgrade: sgi-csn displays error, postinst script needs to handle abort case HPCM-5395 cinstallman --update-image with apt does not check rpmmgrImage return code properly. HPCM-5397 Ubuntu upgrade: refresh-image fails on installing cmdb-rest-lib, conflicts with cm-rest-lib HPCM-5399 Generate epd file for EX255a HPCM-5403 ERROR entry messages found in *.s files HPCM-5404 Unit off should be identified only for SIVOC, 48V ECB, 48v-12V HPCM-5405 EX255a: Add repair actions HPCM-5408 Alerting enable validation should continue when there is a failure instead exiting HPCM-5409 Modify cm monitoring alerting status command output to include routing status HPCM-5416 asyncio_cmdb: Add map_keys to CmutilAsyncIO functions. HPCM-5418 Add support for reporting MBE in HPCM Slingshot reporting HPCM-5421 Update to 1.10 patch 11793 HPCM-5422 EX255a: Add repair actions for few registers HPCM-5424 Add new cm-power-services source code HPCM-5427 clientsession_kw: regression with duplicate timeout args HPCM-5428 Add CONSERVER_RW and CONSERVER_RO to configure-cluster, and cluster configuration file HPCM-5435 Validate EX425 On flow HPCM-5442 Add XD670 support to cfirmware HPCM-5445 HPCM changes to use SS 2.2 heartbeat feature HPCM-5449 Add wrapper to run rochpcg HPCM-5451 Add latest HTT to HPCM - patch HPCM-5452 EX254n: Add repair actions HPCM-5456 su-leader /etc/hosts should have other leader nodes listed HPCM-5457 update all scripts with short args as i/p HPCM-5458 check_accessibility.py: Specifications of name of Management switches HPCM-5459 Fabric inventory: permission denied for both right and wrong ip HPCM-5460 controllers such and the nC will have firmware in tar.gz format. Need to support ingesting this file type HPCM-5463 'DAC stall' Error HPCM-5466 Add babelstream binaries and script for EX255a HPCM-5467 Add transferbench binary for EX255a HPCM-5469 Remove rochpl & rochpcg build part and mhist the binary HPCM-5471 Catch invalid mac-addresses for active_gateway in switchconfig HPCM-5480 Remove ESM check for EX425 HPCM-5481 check_node IFS issue HPCM-5482 hardware-triage-tool: make the logpaths RFC 3339/ISO_8601 compliant HPCM-5485 timescale sink writing empty labels HPCM-5486 HPCM Slingshot Dashboards need update to improve performance HPCM-5488 SS 2.2: Alerting support for heartbeat: switch status HPCM-5489 discover_skip_switchconfig has a comma in configure-cluster preventing it from being set HPCM-5494 Add an option for just collecting serial numbers HPCM-5495 RFE: HPCM provided gpu_sizzle should be statically compiled if possible (highly preferred) HPCM-5496 DOCS: Monitoring Guide Updates Needed for Patch 11796 HPCM-5497 HPCM 1.11 Customer Reported RFEs HPCM-5498 HPCM 1.11 Non-roadmap Features and Improvements HPCM-5499 HPCM 1.11 Code Clean Up and Internal Facing Improvements HPCM-5500 HPCM 1.11 Upgrade external/mhisted components HPCM-5504 Add nvidia HPCG for EX254n HPCM-5505 cfirmware sc check|update|type not working on new slingshot blade switches HPCM-5508 pdu-collect: logging broken HPCM-5509 Fix Postgres pg_hba file to allow ipv6 connections HPCM-5510 IPv6_rpfilter=yes in firewalld.conf blocks IPv6 traffic on QHA VM. HPCM-5511 uboot not rebooting after update HPCM-5512 Add node-power cap service to build HPCM-5513 Update Oblex diagnostics for EX255a HPCM-5514 cfirmware cannot update Cassini with HPCM 1.10 HPCM-5515 node power cap: Fix patch return components list HPCM-5516 cfirmware nc checkall gives random results HPCM-5519 Add dgemm for A1 APUS for EX255a HPCM-5521 Change Hardware recipe names to external names HPCM-5522 HTT: Investigate how to address the confidentiality concerns HPCM-5525 ADMIN: Gluster vlumes are mounted multiple times over head and head-bmc HPCM-5528 Add miniHPL with crayMPI for EX255a HPCM-5529 Add 4 point compliance matrix text EX255a HPCM-5530 CheckNodeHealth failure throwing incomplete file path HPCM-5531 Alerting: cm health alertman shows incorrect error message when API status is down HPCM-5532 Alerting: cm monitoring alerting cmd should handle api errors gracefully (due to proxy issues) HPCM-5533 Provide (hpcm_pcs.go) for System Power Capping HPCM-5534 cfirmware HPCM 1.10 fails to update cc --recovery_image HPCM-5536 cm Command to show connector configs and set properties HPCM-5539 update shibhuya stream to run cpu-hbm HPCM-5540 move diags across perf, field and noship for patch HPCM-5541 Remove hpl EX254n with openmpi from diags HPCM-5543 stout728 sles11sp4 x86_64 Nov21-23 rpm-phase build failed in diags:cluster HPCM-5546 cm cli for wlm (SLURM/PBS) monitoring - enable/disable/status HPCM-5547 nvidia-gpu-xhpl PERCENT 5 throwing Out Of Memory error HPCM-5548 HPCM 1.10: PDU Monitoring grafana dashboard fails to load any data HPCM-5553 RHEL: mariadb error when cloning slot on admin HPCM-5554 run_4pt_screen.sh always fails on 2nd run HPCM-5555 Add power and bios support for XD6500/XD670 B1 SPR HPCM-5556 add monitoring support for XD6500/XD670 B1 SPR HPCM-5559 sgi-fabmgmt can conflict with opensource filelock python module HPCM-5560 Add power support for XD224 HPCM-5561 su-leader-setup --destroy fails with 1.10 patches installed HPCM-5564 Aruba Switch Firmware Refresh + Qualification (HPCM 1.11) HPCM-5565 HPE Switch Firmware Refresh (HPCM 1.11) HPCM-5573 Timescale helpers allow null or empty filters HPCM-5574 Add check_timeline to Patroni config HPCM-5582 EX254n diags failure due to recipe change HPCM-5583 Compile EX255a binaries with rocm6.0 HPCM-5584 PIP warnings during library installation HPCM-5585 HPCM 1.10: su-leaders missing dependency on iptables for ip failover event from ctdb HPCM-5586 sensormon: Add AMI support to redfish polling HPCM-5588 psycopg v3 python bindings HPCM-5589 Remove overlap between perf and noship diags HPCM-5590 Add new cm-power-services rpm HPCM-5592 q-ha physical admin node upgrade: sgi-admin-node %post scriptlet fails HPCM-5593 distro-rpm-lists should call crepo --recreate-rpmlists on upgrade HPCM-5594 30-set-dns returns error on q-ha HPCM-5596 Rename distro-rpm-lists to distro-pkg-lists HPCM-5597 Q-HA: Upgrade sles15sp4 HPCM 1.9 -> sles15sp5 HPCM 1.10 AdminVM fails to configure - unsupported configuration: chardev 'spicevmc' HPCM-5598 Remove no-shippable diags from perf aarch64 rpm HPCM-5599 Don't turn off services in clone-slot HPCM-5600 cm-cli localhost node option doesn't work when cmdb isn't running HPCM-5602 HPCM1.10 admin DNS_DOMAIN set to cluster instead of house HPCM-5603 Fix linpack and nvidia-gpu-xhpl on EX254n HPCM-5604 HTT fails with UnicodeDecodeError while checking MCE Errors HPCM-5606 shibuya stream random BW results APU to APU HPCM-5607 cm-power-service: Add sysd,config, install HPCM-5608 implment iscsi diskless root support HPCM-5609 SS dashboards enablement through cm monitoring slingshot CLI HPCM-5614 Fix Aruba switch hardware limit 15 vMACs / switch pair on Cray EX HPCM-5615 Add switchconfig find for OIDs for Aruba switches HPCM-5618 Improve the kafka notification policy in Grafana alerting HPCM-5620 Upgrade CPE on black to 23.12 HPCM-5621 Node-PowerCap: Bug with empty requests HPCM-5622 Q-HA upgrade from HPCM 1.9 -> 1.10 SLES15 SP5: HA - VM fails to be accessible outside of physical host HPCM-5623 cm node slot copy allow 'localhost' option to be used for cloning slots on admin node for q-ha HPCM-5624 iscsi provision image names not compatible with IQN convention HPCM-5626 cmcinventory - consider enabling iscsi by default for root HPCM-5627 iscsi provision - route configure-iscsi log through systemd for time stamps HPCM-5632 quorum-ha virtual admin with heavy connection count causes physical host to be fenced HPCM-5634 Alerting: CDU telemetry metrics HPCM-5636 Alerting: Leak Events - CDU/Cabinet/ HPCM-5638 rochpcg looks for intelmpi HPCM-5639 Remote support: Add memory event support for XD6500 HPCM-5640 Remote support: Add support for CPU events HPCM-5642 pdu-collect: Fix community string HPCM-5643 HPCM 1.10: cm health check fabricperf must support ib2 and above HPCM-5644 Additional fru inventory values for Intel (and fixes) HPCM-5645 missing file for kafka setup HPCM-5647 HPCM1.10 + patch11795 Babelstream diag -device option not working. HPCM-5650 system-power-capping hpcm inventory includes too many nodes HPCM-5655 run_dgemm_EX255a runs on each GPU serial needs to be parallel HPCM-5656 confluent-kafka needs to be masked when setup as proxied on admin HPCM-5662 pdu-collect: kafka_push is broken HPCM-5663 Create a monitoring support collection tool HPCM-5666 HPCM1.10: Elevate the quality of error handling in Slingshot health reporting. HPCM-5667 monitor scripts didn't account for leaders with no squashfs existing, check and egg issue for startup HPCM-5669 80-enable-sysrq is broken HPCM-5671 kdump crashkernel cmdline param not defined on virt or phys admins HPCM-5672 luks2 root disk encryption using TPM 2 admins, leaders, compute - gluster spaces not encrypted HPCM-5800 cm cli with luks2 root disk encryption using TPM 2 HPCM-5801 q-ha TPM state when VM migrates physicals for luks2 root disk encryption using TPM 2 HPCM-5802 System Power Capping enablement and usage documentation in HPCM 1.11 HPCM-5803 be sure iscsi is enabled by default server side (was only enabled for new installs before) HPCM-5804 DOC: Update Cluster Manager Ports Information in Admin Guide HPCM-5806 Update TS query on GPU dashboards HPCM-5807 Upgrade AMDXIO to resolve AMDXIO seg fault issue HPCM-5808 Update Opensearch port from hardcoded values HPCM-5809 rhel89: admin fails to load on disk HPCM-5811 Update TS queries on CDU dashboards HPCM-5812 Don't require nscd and instead make it recommended package HPCM-5817 Remote Support: Drive Collection in Failed Drive Events HPCM-5820 Allow a name to be specified when creating a repo HPCM-5822 HTT: Update README file HPCM-5827 HTT: Invalid MCE bank and ipid pattern for EX255a HPCM-5828 HPCM 1.10: Logstash grafana dashboard does not have the external links to SIM, Monitoring services, AIOPs services and the SU_leader services HPCM-5831 Placeholder Task: Removal of deprecated features/options/CLI/files HPCM-5834 HPCM should not configure admin bonding when virtualized HPCM-5835 HPCM1.11/rhel93: miniroot creation fails HPCM-5836 Fix the build failure issue on black for noship diags rpm buils after deleting the lines corresponding to the common files of perf diags and noship diags in noship spec file to remove overlap. HPCM-5837 cm command line shouldn't allow underscore in network names - breaks named HPCM-5840 admin node luks2 password prompt should not echo password HPCM-5843 Add Admin Node VM awareness to switchconfig sanity_check HPCM-5850 q-ha upgrade: systemimager should not re-enable services on physical admin nodes HPCM-5851 sles15sp5: pulling image from node fails HPCM-5854 Diags failure on x86 sles15sp5 cluster because of craype lib mismatch HPCM-5856 async_apis: Fix Error with missing creds HPCM-5857 /opt/clmgr/tools/cm_check_ips always returns status 1 when checking IP addr HPCM-5858 cpwrcli: Fix perif-power-on|off option HPCM-5859 cm monitoring alerting status command to show the alert rule group HPCM-5866 Update power docs: HPCM-5870 Add nfs-utils for rhel and nfs-client for sles as recommended packages HPCM-5872 direct attach nVME may result in unable to boot on admin node HPCM-5875 jobmonitor: missing restserver_host option in config HPCM-5876 Change new cm-power-services port: HPCM-5877 stout7: iscsi target does not get setup on leaders HPCM-5880 Add kdump as a recommended package to cm-recommends HPCM-5881 HPCM 1.11: "cm monitoring slurm status" generates an error. HPCM-5882 HPCM 1.11: Alerting Grafana Framework issue HPCM-5885 SLES15 SP5 QU c-c fails on brltty and libbrllap versions too old HPCM-5886 QHA, linpack, and Slurm issue HPCM-5888 Add deprecation notice to alerta related CLI (all) HPCM-5890 Change Error Pattern for MCA bank error HPCM-5891 sensors_node pipeline doesn't exist HPCM-5892 HPCM 1.11: Alerting enable command not gracefully manage API errors (due to proxy issues) HPCM-5896 HPCM 1.11: The absence of slingshot-heartbeat.service is attributed to the absence of slingshot-monitoring in the cm package. HPCM-5897 Listener is not set with SS 2.2 telemetry config HPCM-5899 SAC-HA: RHEL 8.9 /images umount hung during FIPS upgrade procedure. HPCM-5902 Add screen to cm-managed-recommends HPCM-5903 HPCM1.10: Slingshot dashboards cannot be enabled using the command "cm monitoring slingshot enable". HPCM-5904 HPCM1.10: unable to establish the configuration of Slingshot with active FMN. HPCM-5906 cm-power-srv. REST API fix needed for multi-node ctrl HPCM-5911 cm-slingshot-udev doesn't sort devices correctly for EX254n nodes HPCM-5912 mofed bits don't make it to image when script is run HPCM-5913 HPCM1.11/rhel89: DL365gen11 panics booting rhel89 (rhel88, rhel810 are fine) [Make it possible for extra kernel params admin node post-boot] HPCM-5918 cm chassis cmc firmware show command fails HPCM-5919 asyncio_cmdb: Fix/update get_node_fields/get_compute_node_fields HPCM-5921 HPCM 1.11: Missing argument in sprintf at chc_wrapper.pm file HPCM-5922 HPCM 1.11: "cm health report slingshot link mbe" reports an 'linkmbe_parser' is not defined. HPCM-5926 AMDXIO core dumps HPCM-5928 CSM: Hardware Triage Toolkit unexpectedly failing when incorrect hardware config file is provided HPCM-5929 HPCM 1.11: run_babelstream not handling proper exit for empty arraysize HPCM-5931 HPCM 1.11 check_node.sh leaves executable shell script on the target node HPCM-5932 gluster leaders - disable insecure NFS by default HPCM-5933 HPCM 1.11: cxi_nic_failure.sh leaves executable bash script in the root dir of target nodes HPCM-5939 hpcg failing for EX254n HPCM-5940 HPCM-5521 is not merged into HPCM 1.11 HPCM-5942 HTT fails with TypeError on EX255aNC HPCM-5945 pulling image from running ubuntu node fails HPCM-5948 switchconfig reports head nodes bonding as active-backup, yet is set to 802.3ad HPCM-5949 fabricperf reporting lower bandwidth on NDR fabric HPCM-5950 HPCM 1.11 hwtriage leaves nfpga_print_regs script on the node controller and does not clean-up HPCM-5951 HPCM 1.11: During the configuration setup for Slingshot, the periodicity values fail to update in FMN (as observed through "fmn-show-telemetry-config"). HPCM-5953 HPCM 1.11: Inactive configurations of Slingshot are attempting to be activated, resulting in a mix-up of configurations. HPCM-5955 LDMS dashboard is not working and reporting this error "Failed to upgrade legacy queries Datasource" HPCM-5958 Update or remove .diagnose_node script from HTT HPCM-5959 cadmin throws traceback when checking root labels HPCM-5960 Recent change broke checkall function for cfirmware HPCM-5963 ss cassini alerts CXI_EVENT_LINK_DOWN not being resolved when CXI_EVENT_LINK_UP event received HPCM-5964 Fix: Perf and noship Diags overlap fixes HPCM-5965 HPCM 1.11: confluent-schema-registry.service remains in failed state on su-leader. Grafana Dashbord keep notifying the same. HPCM-5967 HPCM 1.11: Grafana dashboard not working due to timescaledb user issue HPCM-5968 cm image create missing option to create bt tarball HPCM-5969 Show if bt tarball was created in cm image show HPCM-5970 HPCM 1.11: Cluster health AMD command not working in Bardpeak node HPCM-5972 cm-power-services: rpm post, link update not working HPCM-5973 cm-power-service: controller endpoint don't include cec HPCM-5974 cm-power-services: add @secure wrapper to POST/PATCH routes HPCM-5977 sles15sp5 install fails, grub has no initrd, related to change from mkinitrd to dracut HPCM-5982 HPCM 1.11: Rectifier Check dashboard throws error about incorrect data source HPCM-5983 stout7 discover iscsi bug - roofs checking nfs in iscsi section addServiceNodeToCluster HPCM-5986 Fix diags build related to COOP-1296 HPCM-5987 Add latest HTT to HPCM 1.11 HPCM-5991 cm node slot copy and cm-node-slot-copy.8 should specify that leaders and non-diskful nodes aren't supported HPCM-5992 amdgpu-xhpl not working on bardpeak HPCM-5993 EX235n nvidia-gpu-xhpl breaking HPCM-5997 "Failed to upgrade legacy queries Datasource" and fixing Grafana UID HPCM-5998 cm node add / discover fails; can't find admin in head mgmt net HPCM-6000 export_fabric_template fails in SlingShot 2.2 , so cm health report does not work HPCM-6001 Update TS queries on PDU Dashboards HPCM-6003 HPCM 1.11: All functional health checks are showing failures on Ubuntu nodes. HPCM-6004 HPCM 1.11: The "cm health check cpuchk" command is indicating failures HPCM-6007 HPCM 1.10: su-leader-setup --help should run despite configuration issues. HPCM-6008 cm node slot copy should show progress of sync HPCM-6009 Q-HA: SLES15 SP5: Physical hosts are installed such that 2,049 nfsd processes are running HPCM-6011 HPCM 1.11: cm health alertman switch command not working HPCM-6012 HPCM 1.11: Slingshot Heartbeat not working properly in alerting. HPCM-6013 provide wiki instructions mfg - stage and re-use images and repos HPCM-6014 system-power-capping get health fails HPCM-6015 Q-HA: Rocky: Recovery from link down may not bring gluster online HPCM-6021 nvdidiagpu-xhpl failing on EX235n HPCM-6022 HPCM1.11 PBS alert not working HPCM-6023 Add copytruncate to various services logrotate settings HPCM-6029 fix typo in echo statement miniroot-functions HPCM-6031 HPCM1.11 sles15sp6 image fails - no mkinitrd HPCM-6033 HPCM 1.11: Unable to retrieve the report of link MBE events within a specific time or timeframe. HPCM-6034 HPCM 1.11: Unable to display the report with additional fields such as CableId in the MBE report. HPCM-6035 HPCM 1.11: Unable to fetch the Columbia switch/port details for inclusion in Slingshot health reporting. HPCM-6036 HPCM 1.11: Unable to execute any functionalities related to "cm health report slingshot port rxpause/txpause". HPCM-6043 mpower --gpu --get-power: Not Displaying values HPCM-6046 nvdidiagpu-xhpl failing on EX235n - prob size not calculated for input percent HPCM-6047 HPCM 1.11: Executing "cm monitoring rackmap map temperature/power -d" results in NullValueNotAllowed errors. HPCM-6049 HPCM 1.11: Slingshot switch group status not working properly in grafana HPCM-6050 Rebuild xkdiags and rank for SLES CPE 23.12 (x86) HPCM-6053 False Ping Failures on large non-su-leader systems HPCM-6054 Update docs to properly upgrade and versionlock conflicting HPCM packages if COS or EPEL is used HPCM-6055 85-nid-hostname does not work with non-padded hostnames HPCM-6057 cmcinventory: arch template info not working. HPCM-6058 XD224: Hello_world diag execution hangs while executing. HPCM-6059 XD224: test4 diag execution failing with slurm error. HPCM-6061 XD224: hpcg diag execution failing with slurm error. HPCM-6063 HPCM 1.11: Please handle Error messages for "cm monitoring pbs status/enable" command HPCM-6064 HPCM 1.11: AIOPS service fails after admin upgrade from HPCM1.10 to HPCM1.11 HPCM-6067 Upgrade SDU to 2.3.2 HPCM-6070 HPCM 1.11: Upgrade: "iSCSI Login negotiation failed. rx_data returned 0, expecting 48." error messages keep flooding on su-leader's console as well as in /var/log/messages/ after rupgrade HPCM-6071 HPCM 1.11: UPGRADE: Upgrade documentation should include ISCSI related steps after upgrade from hpcm1.10 to hpcm1.11 HPCM-6072 HPCM 1.11: Upgrade: Some slingshot connecters do not enable after upgrade from HPCM1.10 to HPCM1.11 HPCM-6073 Add/Fix cbios & cpower support for XD224 HPCM-6074 add-ipv6-bond0.py should use full path to call 'cm' command HPCM-6077 On systems with SU-leaders, opentracker-ipv4 and cm-aria2c start before gluster is mounted HPCM-6078 Add latest HTT to HPCM 1.11 Feb 15 HPCM-6079 linpack,hpcc & stream not getting installed as part of perf_diag HPCM-6080 HPCM 1.11: AIOPs grafana dashboards have to use mon_reader user instead of postgres user HPCM-6081 mpower: XD224 fix get-limit (NVIDIA) HPCM-6084 disable-su-leader should talk about iscsi in addition to nfs HPCM-6088 HPCM 1.11: After leaders reboot elk services not running: opensearch HPCM-6089 DOC: HPCM 1.11: Upgrade: Old grafana dashbords needs to be handle/deleted after upgrade. HPCM-6090 HPCM 1.11: After upgrade alerting got disabled so not getting alerts HPCM-6097 HPCM 1.11 + SS 2.2: In "cm monitoring slingshot config" collector is set as FMN, But in "/telemetry/configurations/hpcm_config/" collector is coming as Listener. HPCM-6100 Recompile ARM HPCG with new CPE HPCM-6101 80-tftp-setup: grep: /usr/lib/systemd/system/tftp.socket.d/tftp- override.conf: No such file or directory HPCM-6102 RHEL8: Rebooting into cloned-slot fails when fips enabled HPCM-6103 HPCM 1.11 + SS 2.2: In "cm monitoring slingshot config" periodicity is set as 10, But in "/telemetry/configurations/hpcm_config/" periodicity is coming as 60. HPCM-6104 80-logstash-configure scp'ing to localhost, which may not be permitted to ssh HPCM-6111 cm health report slingshot refresh not dumping neighbour ports for local and global ports HPCM-6112 HPCM1.10:ROCKY88: Patch: No way to disable slingshot_congestion from alerting because of it we are getting failure messages in /var/log/messages on non slingshot system HPCM-6116 release note certain iscsi errors that happen when a node is assigned to an image that isn't activated HPCM-6117 Q-HA: gluster | dshbak -c after stopping libvirtd does not match upgrade example HPCM-6118 cm-logrotate-parallel needs to avoid any log files that were already compressed HPCM-6119 Stopping cmdb causes several services to also stop HPCM-6120 Final HPCM 1.11 Aruba Firmware Recommendation HPCM-6123 HPCM 1.11: UPgrade: Patroni service not started after admin upgrade from hpcm1.10 to hpcm1.10. All monitoring were enabled and runing before upgrade. HPCM-6124 PIP warnings during library installation execution failed in RHEL89 and ROCKY89 distros HPCM-6129 HPCM 1.11: UPGRADE: Running scriptlet: clusterhealth seems not successefull during upgrade HPCM-6131 HPCM1.11/cfirmware fails to update slingshot switch HPCM-6132 Rackmap throws exception when map doesn't exist HPCM-6133 pdu-collect: Readme not rendering from landing page HPCM-6134 Fix output of timescale show schema upgrades HPCM-6135 showrev does not output CM Release and CM Build when -j selected HPCM-6136 Q-HA: internal-set-root-label is not working on Rocky HPCM-6137 stout7: rocky89/rhel89: su-leader install failed due to 80-csn-distro-services failure HPCM-6138 remlog-collect: Regression with session_key removal HPCM-6144 HPCM 1.11: UPGRADE: On ICE admin sles15sp5 upgrade from hpcm1.10 to hpcm1.11 fails to upgrade kernel.  HPCM-6146 Add file containing port information in the /docs directory of the ISO and on the system HPCM-6147 XD224 Nodes Need Sensormon support HPCM-6148 HPCM 1.11: cm support moncollect syntax does not reflect that -w and -n dependent on each other and are not mutually exclusive HPCM-6149 HPCM 1.11: ssh fails on name resolution while running cm support moncollect HPCM-6150 HPCM 1.11: UPGRADE: Image details like(imageObject, imageObjectSize, imageObjectCreationTime) shows Undefined after image upgrade. "cm image show -v" command takes more time than expected.  HPCM-6152 HPCM 1.11: ALERTING: cm health alertman sim -s throws Exception: 'datasource' ERROR: Failed to connect to alertmanager. HPCM-6154 xkdiags failing for x86 and arm HPCM-6156 HPCM 1.11: AIOPS: Trainer bugs HPCM-6157 Q-ha SLES15sp5 QU1 qemu-generated adminvm.xml files generated HPCM-6163 HPCM 1.11: UPGRADE: monitoringdb Version and USER do not upgrade. Database connection fails with error "psql: FATAL: role "mon_reader" does not exist" HPCM-6164 fabricperf not giving expected performance HPCM-6168 cpwr REST API: content flag not being passed along HPCM-6169 HPCM 1.11: UPGRADE: Ubuntu image fails to upgrade. HPCM-6174 HPCM 1.11: Regression: python urllib3 error with hwtriage CLI HPCM-6175 mpwrcli: node --set-limit Regression (uri_key no longer 'Chassis' HPCM-6176 Add timeout setting to pdf-settings.ini HPCM-6180 Add latest HTT to HPCM 1.11 HPCM-6182 cm controller show produces exception when controller with missing nic exists in db HPCM-6184 HPCM 1.11: Regression: cm monitoring pbs status throws error if telegraf rpm is not installed HPCM-6187 Parser.pm includes admin when a node hostname that matches the admin hostname is specified HPCM-6188 cinstallman should use /opt/clmgr/bin/pdsh instead of the default pdsh HPCM-6189 build online diags with rocm 6.0.0 (black) HPCM-6190 sles15sp5: logrotate service is failing, some configs call /etc/init.d/syslog which no longer exists HPCM-6197 HPCM 1.11: cm monitoring timescaledb show --patroni-state throws an error HPCM-6198 HPCM 1.11: Regression: timescaledb show command when using --patroni-state option fails HPCM-6200 Change HPCM to 1.11 in the message so CrayOPC can filter events correctly HPCM-6202 Setting up timescale fails when using --no-schema-upgrade HPCM-6208 HPCM 1.11: Upgrade:AARCH64:Rocky88-Rocky89: Running scriptlet: field_diags_licensed_aarch64 script fails with error during upgrade HPCM-6209 DOC: Detailed performance dashboard uses DESC when it should be ASC in query HPCM-6211 Slinshot metric names exceed Postgres table name limit HPCM-6213 Fabric summary dashboard use incorrect datasource HPCM-6214 Move slingularity examples out of manuals and into separate doc HPCM-6215 cm support collect does have a separator between repo group outputs HPCM-6222 Unittests Only: asynctest is dead. Fix async unittest HPCM-6230 AIC: NHC checks not running on compute nodes HPCM-6234 Remove all the clusterstor lustre HPCM-6242 HPCM 1.11: AMD cm health check commands failed HPCM-6243 Diags build failing on black which is building with rocm HPCM-6245 HPCM 1.11: Keep the Timescale Grafana alerting CDU rules disable by default HPCM-6247 HPCM 1.11: Regression: All cm commands genereate SyntaxError: unmatched ')' in cm.log HPCM-6248 Regression:Data from AMD GPUs is not being generated in native monitoring. HPCM-6250 current # cm wlm install setup for slurm has a bug in it HPCM-6254 HPCM1.11 Reduced the slingshot switch alert query to 1 min ****************************************************************************** Product-ID: HPE Performance Cluster Manager 1.11.0 - Update 01 Last edit: Wed Mar 27 14:13:44 CDT 2024