Introduction

In the world of high-performance computing and deep learning, the NVIDIA DGX H100 emerges as a true titan, boasting a remarkable hardware ensemble designed to deliver unmatched performance. With 8 NVIDIA H100-GPUs, 18 NVLink connections per GPU, 4 NVIDIA NVSwitches, and 2 terabytes of RAM, this system is an ideal choice for intensive AI tasks. To utilize its GPUs, a crucial step involves installing NVIDIA/CUDA drivers and activating the Fabric Manager. The goal is to get a CUDA-enabled container up and running.

The target system has Red Hat Enterprise Linux release 8.8 (Ootpa) installed.

Linux hopper-dgx 4.18.0-477.13.1.el8_8.x86_64 #1 SMP Thu May 18 10:27:05 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

First, an up-to-date driver bundle must be downloaded from the official NVIDIA website. It must be ensured that the package is suitable for data center GPUs and is not a consumer driver.

Official Drivers | NVIDIA

In addition, the desired CUDA version must be selected. In my case I took the most recent one. A small indication that you got the right package is the file size. The standard NVIDIA driver is approximately 350 megabytes in size. The data center package including CUDA driver and Fabric Manager is almost twice the size:

Data Center Driver For Linux RHEL 8
 
Version:	        535.86.10
Release Date:	        2023.7.31
Operating System:	Linux 64-bit RHEL 8
CUDA Toolkit:	        12.2
Language:	        English (US)
File Size:	        603.76 MB

Direct link to the driver used here.

As our system is air-gaped and can’t reach the internet, we install the driver repo package.

rpm -i nvidia-driver-local-repo-rhel8-535.86.10-1.0-1.x86_64.rpm

First we install the hard dependency dkms (Dynamic Kernel Module Support) from e.g. EPEL, which is needed to compile and load the kernel modules. Right after this nvidia-drivers and cuda-drivers are getting installed. The nvidia-fabric-manager is nessecary too, as the H100s are connected via NVLinks and NVSwitches.

Note: The version of Fabric Manager must match the NVIDIA driver installed.

dnf install podman dkms nvidia-drivers cuda-drivers nvidia-fabric-manager

For GPU support in containers we need to install the [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) which is available in official NVIDIA repos.

rpm -i libnvidia-container1
rpm -i libnvidia-container-tools
rpm -i nvidia-container-toolkit-base
rpm -i nvidia-container-toolkit

Then we enable and start the nvidia-fabric-manager and the nvidia-persistenced systemd services. The Fabric Manager effectively manages inter-GPU communication, while the latter resolves the issue of slow nvidia-smi responses by optimizing its behavior when querying driver and hardware information during each call.

systemctl enable --now nvidia-fabricmanager.service
systemctl enable --now nvidia-persistenced.service

In the last step we’re generating a CDI (Container Device Interface) definition and grant access for non-root users on the resulting file. CDI provides a standardized framework for container runtimes to support third-party devices within containerized environments, simplifying the integration of specialized hardware resources. It’s abstract device representation and focus on device awareness enable containers to efficiently utilize external devices.

nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
chmod 644 /etc/cdi/nvidia.yaml

HPCG Benchmark

Now you can start the container with CDI config and InfiniBand (IB) devices enabled.

podman run -it --rm \
               --net host \
               --ipc host \
               --device nvidia.com/gpu=all \
               --device /dev/infiniband/issm0 \
               --device /dev/infiniband/rdma_cm \
               --device /dev/infiniband/umad0 \
               --device /dev/infiniband/uverbs0 \
               nvcr.io/nvidia/hpc-benchmarks:23.5

In the running container just start the benchmark:

root@hopper-dgx:/workspace# mpirun -n 8 ./hpcg.sh --dat /workspace/hpcg-linux-x86_64/sample-dat/hpcg.dat

HPCG-NVIDIA 23.5.0  -- NVIDIA accelerated HPCG benchmark -- NVIDIA

Start of application (128 OMP threads)...
Cusparse version 12.1
Format = SELL, Optimization = HAND-TUNED, Coloring = LEGACY

GPU: 'NVIDIA H100 80GB HBM3' 
Memory use: 16667 MB / 81008 MB
2x2x2 process grid
256x256x256 local domain
Initial Residual = 11304.4
Iteration = 1   Scaled Residual = 0.186679
Iteration = 2   Scaled Residual = 0.102562
Iteration = 3   Scaled Residual = 0.0708169
Iteration = 4   Scaled Residual = 0.0541091
Iteration = 5   Scaled Residual = 0.0437576
Iteration = 6   Scaled Residual = 0.0367107
Iteration = 7   Scaled Residual = 0.0316121
Iteration = 8   Scaled Residual = 0.0277661
Iteration = 9   Scaled Residual = 0.0247719
Iteration = 10   Scaled Residual = 0.0223765
Iteration = 11   Scaled Residual = 0.0204105
Iteration = 12   Scaled Residual = 0.0187576
Iteration = 13   Scaled Residual = 0.0173375
Iteration = 14   Scaled Residual = 0.0160995
Iteration = 15   Scaled Residual = 0.0150126
Iteration = 16   Scaled Residual = 0.0140567
Iteration = 17   Scaled Residual = 0.0132171
Iteration = 18   Scaled Residual = 0.0124801
Iteration = 19   Scaled Residual = 0.01183
Iteration = 20   Scaled Residual = 0.0112505
Iteration = 21   Scaled Residual = 0.0107262
Iteration = 22   Scaled Residual = 0.010244
Iteration = 23   Scaled Residual = 0.00979711
Iteration = 24   Scaled Residual = 0.00938376
Iteration = 25   Scaled Residual = 0.00900528
Iteration = 26   Scaled Residual = 0.00866434
Iteration = 27   Scaled Residual = 0.00836168
Iteration = 28   Scaled Residual = 0.00809436
Iteration = 29   Scaled Residual = 0.00785625
Iteration = 30   Scaled Residual = 0.00763841
Iteration = 31   Scaled Residual = 0.00743172
Iteration = 32   Scaled Residual = 0.00723063
Iteration = 33   Scaled Residual = 0.00703311
Iteration = 34   Scaled Residual = 0.00683943
Iteration = 35   Scaled Residual = 0.00664959
Iteration = 36   Scaled Residual = 0.00646027
Iteration = 37   Scaled Residual = 0.00626613
Iteration = 38   Scaled Residual = 0.00606467
Iteration = 39   Scaled Residual = 0.00585985
Iteration = 40   Scaled Residual = 0.00566094
Iteration = 41   Scaled Residual = 0.00547446
Iteration = 42   Scaled Residual = 0.00530383
Iteration = 43   Scaled Residual = 0.00515553
Iteration = 44   Scaled Residual = 0.0050298
Iteration = 45   Scaled Residual = 0.00491953
Iteration = 46   Scaled Residual = 0.00481944
Iteration = 47   Scaled Residual = 0.00472162
Iteration = 48   Scaled Residual = 0.00462407
Iteration = 49   Scaled Residual = 0.0045241
Iteration = 50   Scaled Residual = 0.00442526
WARNING: PERFORMING UNPRECONDITIONED ITERATIONS
Call [0] Number of Iterations [11] Scaled Residual [1.19644e-14]
WARNING: PERFORMING UNPRECONDITIONED ITERATIONS
Call [1] Number of Iterations [11] Scaled Residual [1.19644e-14]
Call [0] Number of Iterations [2] Scaled Residual [1.4468e-17]
Call [1] Number of Iterations [2] Scaled Residual [1.4468e-17]
Departure from symmetry (scaled) for SpMV abs(x'*A*y - y'*A*x) = 0
Departure from symmetry (scaled) for MG abs(x'*Minv*y - y'*Minv*x) = 0
SpMV call [0] Residual [0]
SpMV call [1] Residual [0]
Initial Residual = 11304.4
Iteration = 1   Scaled Residual = 0.220782
Iteration = 2   Scaled Residual = 0.119582
Iteration = 3   Scaled Residual = 0.0813921
Iteration = 4   Scaled Residual = 0.0616581
Iteration = 5   Scaled Residual = 0.0496728
Iteration = 6   Scaled Residual = 0.0415839
Iteration = 7   Scaled Residual = 0.0357695
Iteration = 8   Scaled Residual = 0.0313917
Iteration = 9   Scaled Residual = 0.0279657
Iteration = 10   Scaled Residual = 0.0252145
Iteration = 11   Scaled Residual = 0.0229567
Iteration = 12   Scaled Residual = 0.0210669
Iteration = 13   Scaled Residual = 0.0194657
Iteration = 14   Scaled Residual = 0.0180904
Iteration = 15   Scaled Residual = 0.0168947
Iteration = 16   Scaled Residual = 0.0158475
Iteration = 17   Scaled Residual = 0.0149227
Iteration = 18   Scaled Residual = 0.0141012
Iteration = 19   Scaled Residual = 0.0133675
Iteration = 20   Scaled Residual = 0.0127084
Iteration = 21   Scaled Residual = 0.0121147
Iteration = 22   Scaled Residual = 0.0115778
Iteration = 23   Scaled Residual = 0.0110906
Iteration = 24   Scaled Residual = 0.0106464
Iteration = 25   Scaled Residual = 0.0102382
Iteration = 26   Scaled Residual = 0.00986058
Iteration = 27   Scaled Residual = 0.00950818
Iteration = 28   Scaled Residual = 0.00917513
Iteration = 29   Scaled Residual = 0.0088578
Iteration = 30   Scaled Residual = 0.00855244
Iteration = 31   Scaled Residual = 0.00825767
Iteration = 32   Scaled Residual = 0.00797354
Iteration = 33   Scaled Residual = 0.00770055
Iteration = 34   Scaled Residual = 0.00743929
Iteration = 35   Scaled Residual = 0.00719158
Iteration = 36   Scaled Residual = 0.00695714
Iteration = 37   Scaled Residual = 0.00673668
Iteration = 38   Scaled Residual = 0.00653021
Iteration = 39   Scaled Residual = 0.00633844
Iteration = 40   Scaled Residual = 0.00616225
Iteration = 41   Scaled Residual = 0.00600213
Iteration = 42   Scaled Residual = 0.00585373
Iteration = 43   Scaled Residual = 0.00571183
Iteration = 44   Scaled Residual = 0.00557234
Iteration = 45   Scaled Residual = 0.00543671
Iteration = 46   Scaled Residual = 0.00530511
Iteration = 47   Scaled Residual = 0.00517327
Iteration = 48   Scaled Residual = 0.00503965
Iteration = 49   Scaled Residual = 0.00490952
Iteration = 50   Scaled Residual = 0.00478137
Iteration = 51   Scaled Residual = 0.00465656
Iteration = 52   Scaled Residual = 0.00453913
Iteration = 53   Scaled Residual = 0.004427
Iteration = 54   Scaled Residual = 0.00432293
CG Times = 2924 
Call [0] Scaled Residual [0.00432293]
Call [1] Scaled Residual [0.00432293]
Call [2] Scaled Residual [0.00432293]
...
Call [2921] Scaled Residual [0.00432293]
Call [2922] Scaled Residual [0.00432293]
Call [2923] Scaled Residual [0.00432293]
HPCG-Benchmark
version=3.1
Release date=March 28, 2019
Machine Summary=
Machine Summary::Distributed Processes=8
Machine Summary::Threads per processes=128
Global Problem Dimensions=
Global Problem Dimensions::Global nx=512
Global Problem Dimensions::Global ny=512
Global Problem Dimensions::Global nz=512
Processor Dimensions=
Processor Dimensions::npx=2
Processor Dimensions::npy=2
Processor Dimensions::npz=2
Local Domain Dimensions=
Local Domain Dimensions::nx=256
Local Domain Dimensions::ny=256
Local Domain Dimensions::Lower ipz=0
Local Domain Dimensions::Upper ipz=1
Local Domain Dimensions::nz=256
########## Problem Summary  ##########=
Setup Information=
Setup Information::Setup Time=0.0600462
Linear System Information=
Linear System Information::Number of Equations=134217728
Linear System Information::Number of Nonzero Terms=3609741304
Multigrid Information=
Multigrid Information::Number of coarse grid levels=3
Multigrid Information::Coarse Grids=
Multigrid Information::Coarse Grids::Grid Level=1
Multigrid Information::Coarse Grids::Number of Equations=16777216
Multigrid Information::Coarse Grids::Number of Nonzero Terms=449455096
Multigrid Information::Coarse Grids::Number of Presmoother Steps=1
Multigrid Information::Coarse Grids::Number of Postsmoother Steps=1
Multigrid Information::Coarse Grids::Grid Level=2
Multigrid Information::Coarse Grids::Number of Equations=2097152
Multigrid Information::Coarse Grids::Number of Nonzero Terms=55742968
Multigrid Information::Coarse Grids::Number of Presmoother Steps=1
Multigrid Information::Coarse Grids::Number of Postsmoother Steps=1
Multigrid Information::Coarse Grids::Grid Level=3
Multigrid Information::Coarse Grids::Number of Equations=262144
Multigrid Information::Coarse Grids::Number of Nonzero Terms=6859000
Multigrid Information::Coarse Grids::Number of Presmoother Steps=1
Multigrid Information::Coarse Grids::Number of Postsmoother Steps=1
########## Memory Use Summary  ##########=
Memory Use Information=
Memory Use Information::Total memory used for data (Gbytes)=95.9584
Memory Use Information::Memory used for OptimizeProblem data (Gbytes)=0
Memory Use Information::Bytes per equation (Total memory / Number of Equations)=714.946
Memory Use Information::Memory used for linear system and CG (Gbytes)=84.4482
Memory Use Information::Coarse Grids=
Memory Use Information::Coarse Grids::Grid Level=1
Memory Use Information::Coarse Grids::Memory used=10.09
Memory Use Information::Coarse Grids::Grid Level=2
Memory Use Information::Coarse Grids::Memory used=1.26214
Memory Use Information::Coarse Grids::Grid Level=3
Memory Use Information::Coarse Grids::Memory used=0.157993
########## V&V Testing Summary  ##########=
Spectral Convergence Tests=
Spectral Convergence Tests::Result=PASSED
Spectral Convergence Tests::Unpreconditioned=
Spectral Convergence Tests::Unpreconditioned::Maximum iteration count=11
Spectral Convergence Tests::Unpreconditioned::Expected iteration count=12
Spectral Convergence Tests::Preconditioned=
Spectral Convergence Tests::Preconditioned::Maximum iteration count=2
Spectral Convergence Tests::Preconditioned::Expected iteration count=2
Departure from Symmetry |x'Ay-y'Ax|/(2*||x||*||A||*||y||)/epsilon=
Departure from Symmetry |x'Ay-y'Ax|/(2*||x||*||A||*||y||)/epsilon::Result=PASSED
Departure from Symmetry |x'Ay-y'Ax|/(2*||x||*||A||*||y||)/epsilon::Departure for SpMV=0
Departure from Symmetry |x'Ay-y'Ax|/(2*||x||*||A||*||y||)/epsilon::Departure for MG=0
########## Iterations Summary  ##########=
Iteration Count Information=
Iteration Count Information::Result=PASSED
Iteration Count Information::Reference CG iterations per set=50
Iteration Count Information::Optimized CG iterations per set=54
Iteration Count Information::Total number of reference iterations=146200
Iteration Count Information::Total number of optimized iterations=157896
########## Reproducibility Summary  ##########=
Reproducibility Information=
Reproducibility Information::Result=PASSED
Reproducibility Information::Scaled residual mean=0.00432293
Reproducibility Information::Scaled residual variance=0
########## Performance Summary (times in sec) ##########=
Benchmark Time Summary=
Benchmark Time Summary::Optimization phase=0.0604498
Benchmark Time Summary::DDOT=48.1708
Benchmark Time Summary::WAXPBY=72.0869
Benchmark Time Summary::SpMV=312.546
Benchmark Time Summary::MG=1377.25
Benchmark Time Summary::Total=1810.55
Floating Point Operations Summary=
Floating Point Operations Summary::Raw DDOT=1.2794e+14
Floating Point Operations Summary::Raw WAXPBY=1.2794e+14
Floating Point Operations Summary::Raw SpMV=1.16104e+15
Floating Point Operations Summary::Raw MG=6.50166e+15
Floating Point Operations Summary::Total=7.91857e+15
Floating Point Operations Summary::Total with convergence overhead=7.33201e+15
GB/s Summary=
GB/s Summary::Raw Read B/W=26940.1
GB/s Summary::Raw Write B/W=6226.78
GB/s Summary::Raw Total B/W=33166.9
GB/s Summary::Total with convergence and optimization phase overhead=30123.9
GFLOP/s Summary=
GFLOP/s Summary::Raw DDOT=2655.96
GFLOP/s Summary::Raw WAXPBY=1774.8
GFLOP/s Summary::Raw SpMV=3714.77
GFLOP/s Summary::Raw MG=4720.74
GFLOP/s Summary::Raw Total=4373.58
GFLOP/s Summary::Total with convergence overhead=4049.61
GFLOP/s Summary::Total with convergence and optimization phase overhead=3972.31
User Optimization Overheads=
User Optimization Overheads::Optimization phase time (sec)=0.0604498
User Optimization Overheads::Optimization phase time vs reference SpMV+MG time=0.0123809
DDOT Timing Variations=
DDOT Timing Variations::Min DDOT MPI_Allreduce time=4.21857
DDOT Timing Variations::Max DDOT MPI_Allreduce time=12.1647
DDOT Timing Variations::Avg DDOT MPI_Allreduce time=7.45496
Final Summary=
Final Summary::HPCG result is VALID with a GFLOP/s rating of=3972.31
Final Summary::HPCG 2.4 rating for historical reasons is=4010.46
Final Summary::Please upload results from the YAML file contents to=http://hpcg-benchmark.org