Skip to content

add hook for building GROMACS on NVIDIA Grace CPUs with hwloc support#225

Open
bedroge wants to merge 3 commits intoEESSI:mainfrom
bedroge:gromacs_202602_grace_fix
Open

add hook for building GROMACS on NVIDIA Grace CPUs with hwloc support#225
bedroge wants to merge 3 commits intoEESSI:mainfrom
bedroge:gromacs_202602_grace_fix

Conversation

@bedroge
Copy link
Copy Markdown
Contributor

@bedroge bedroge commented May 8, 2026

Trying the suggestion from EESSI/software-layer#1497 (comment).

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 8, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 8, 2026

Ah, need to wait for a dirty frag mitigation to be deployed.

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 8, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@eessi-bot-jsc
Copy link
Copy Markdown

eessi-bot-jsc Bot commented May 8, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.05/pr_225/14735133

date job status comment
May 08 10:33:30 UTC 2026 submitted job id 14735133 awaits release by job manager
May 08 10:34:19 UTC 2026 released job awaits launch by Slurm scheduler
May 08 10:35:23 UTC 2026 running job 14735133 is running
May 08 11:17:50 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14735133.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-17782386700.tar.gzsize: 0 MiB (27304 bytes)
entries: 1
modules under 2025.06/software/linux/aarch64/nvidia/grace/modules/all
no module files in tarball
software under 2025.06/software/linux/aarch64/nvidia/grace/software
no software packages in tarball
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/reprod
no reprod directories in tarball
other under 2025.06/software/linux/aarch64/nvidia/grace
2025.06/init/easybuild/eb_hooks.py
May 08 11:17:50 UTC 2026 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 17/29 test case(s) from 29 check(s) (4 failure(s), 12 skipped, 0 aborted)
Details
✅ job output file slurm-14735133.out
❌ found message matching ERROR:
❌ found message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 8, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@eessi-bot-jsc
Copy link
Copy Markdown

eessi-bot-jsc Bot commented May 8, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.05/pr_225/14736649

date job status comment
May 08 18:22:28 UTC 2026 submitted job id 14736649 awaits release by job manager
May 08 18:22:54 UTC 2026 released job awaits launch by Slurm scheduler
May 08 18:23:57 UTC 2026 running job 14736649 is running
May 08 19:07:24 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14736649.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-17782668120.tar.gzsize: 0 MiB (27318 bytes)
entries: 1
modules under 2025.06/software/linux/aarch64/nvidia/grace/modules/all
no module files in tarball
software under 2025.06/software/linux/aarch64/nvidia/grace/software
no software packages in tarball
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/reprod
no reprod directories in tarball
other under 2025.06/software/linux/aarch64/nvidia/grace
2025.06/init/easybuild/eb_hooks.py
May 08 19:07:24 UTC 2026 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 17/29 test case(s) from 29 check(s) (4 failure(s), 12 skipped, 0 aborted)
Details
✅ job output file slurm-14736649.out
❌ found message matching ERROR:
❌ found message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 8, 2026

Hmm, it now runs it with export HWLOC_KEEP_NVIDIA_GPU_NUMA_NODES=0 && make check -j 16, but I'm still getting this:

[ RUN      ] HardwareTopologyTest.NumaCacheSelfconsistency
/tmp/eessibot/easybuild/build/GROMACS/2026.2/foss-2025b/gromacs-2026.2/src/gromacs/hardware/tests/hardwaretopology.cpp:190: Failure
Expected equality of these values:
  processorsinNumaNudes
    Which is: 576
  hwTop.machine().logicalProcessors.size()
    Which is: 72

/tmp/eessibot/easybuild/build/GROMACS/2026.2/foss-2025b/gromacs-2026.2/src/gromacs/hardware/tests/hardwaretopology.cpp:211: Failure
Expected: (n.memory) > (0), actual: 0 vs 0

/tmp/eessibot/easybuild/build/GROMACS/2026.2/foss-2025b/gromacs-2026.2/src/gromacs/hardware/tests/hardwaretopology.cpp:211: Failure
Expected: (n.memory) > (0), actual: 0 vs 0

/tmp/eessibot/easybuild/build/GROMACS/2026.2/foss-2025b/gromacs-2026.2/src/gromacs/hardware/tests/hardwaretopology.cpp:211: Failure
Expected: (n.memory) > (0), actual: 0 vs 0

/tmp/eessibot/easybuild/build/GROMACS/2026.2/foss-2025b/gromacs-2026.2/src/gromacs/hardware/tests/hardwaretopology.cpp:211: Failure
Expected: (n.memory) > (0), actual: 0 vs 0

/tmp/eessibot/easybuild/build/GROMACS/2026.2/foss-2025b/gromacs-2026.2/src/gromacs/hardware/tests/hardwaretopology.cpp:211: Failure
Expected: (n.memory) > (0), actual: 0 vs 0

/tmp/eessibot/easybuild/build/GROMACS/2026.2/foss-2025b/gromacs-2026.2/src/gromacs/hardware/tests/hardwaretopology.cpp:211: Failure
Expected: (n.memory) > (0), actual: 0 vs 0

/tmp/eessibot/easybuild/build/GROMACS/2026.2/foss-2025b/gromacs-2026.2/src/gromacs/hardware/tests/hardwaretopology.cpp:211: Failure
Expected: (n.memory) > (0), actual: 0 vs 0

[  FAILED  ] HardwareTopologyTest.NumaCacheSelfconsistency (14 ms)
[----------] 4 tests from HardwareTopologyTest (57 ms total)

cc @al42and

@al42and
Copy link
Copy Markdown

al42and commented May 8, 2026

Huh, fun. The GPU NUMA node seems to have gone away (processorsinNumaNudes went down from 9*72 to 8*72; need to make the name less lewd), but there are still seven mystery NUMA nodes left.

Any chance you can get hwloc XML from the machine? hwloc-ls aarch64-grace.xml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants