Enable user to take ownership of MPI#722
Conversation
…unction name to comm_getMpiComm
…ing for subcomm compiled
…for SubComm, because of course it enforces CXX
…declare that they take ownership of MPI
… updated setComm for init only workflow
…ything but it makes MPI library implementers less nervous
…ith user owned MPI
|
On the plus side, a distributed QFT benchmark circuit runs perfectly correctly: |
| */ | ||
| void initCustomQuESTEnv(int useDistrib, int useGpuAccel, int useMultithread); | ||
|
|
||
| void initCustomMpiQuESTEnv(int useDistrib, int userOwnsMpi, int useGpuAccel, int useMultithread); |
There was a problem hiding this comment.
Do we not want a new initialiser so its really clear where there is a different setup? Also this probably braked compatibility for older software.
There was a problem hiding this comment.
That is a new initialiser! This is effectively a C interface, so no overloading allowed.
There was a problem hiding this comment.
Apologies I had missread this I thought it was replacing initCustomQueSTEnv hence the concern. Once again reading comprehension issue must do better.
There was a problem hiding this comment.
The difference is more subtle than I might have liked, but the alternative would be an initialiser that was way too long!
There was a problem hiding this comment.
initCustomMpiCommQuESTEnv?
|
Looks good not run yet. My only comment is above and raises a wider point where we need to think about how we managed change in the programming interface as we probably want to stabilise as much as possible and only add to existing functionality. |
Agreed, If we're being proper about it, breaking changes to the API require a new major version release, and I don't think we have the time or the inclination currently. |
|
Remaining testing required:
|
iarejula-bsc
left a comment
There was a problem hiding this comment.
Overall, looks good to me, really like the idea! I was wondering if this design would allow running quregs in parallel, each with their own communicator (different envs for each one, maybe?), or if that would be impossible with the current implementation. I think that would be a highly interesting feature. As my team is trying to apply malleability to QuEST, that would be a huge win for us.
A few small comments below, but note this is an untested review.
| MPI_Comm_free(&mpiCommQuest); | ||
| } | ||
|
|
||
| MPI_Comm_dup(newComm, &mpiCommQuest); |
There was a problem hiding this comment.
I think this function should be more conservative. I propose two improvements:
-
Validate
newCommbefore use. the function currently assumesnewCommis valid, but it could beMPI_COMM_NULLor an already-freed handle. -
Check the return value of
MPI_Comm_dupMPI functions return error codes, not exceptions. Ignoring this return value means any failure (e.g. out of resources, invalid communicator) goes undetected andmpiCommQuestcould end up in an undefined state.
Suggested fix:
void comm_setMpiComm(MPI_Comm newComm) {
if (newComm == MPI_COMM_NULL)
throw std::invalid_argument("newComm cannot be MPI_COMM_NULL");
if (mpiCommQuest != MPI_COMM_NULL) {
MPI_Barrier(mpiCommQuest);
MPI_Comm_free(&mpiCommQuest);
}
int err = MPI_Comm_dup(newComm, &mpiCommQuest);
if (err != MPI_SUCCESS) {
char errMsg[MPI_MAX_ERROR_STRING];
int len;
MPI_Error_string(err, errMsg, &len);
throw std::runtime_error(std::string("MPI_Comm_dup failed: ") + errMsg);
}
}|
|
||
|
|
||
| void comm_init() { | ||
| void comm_init(int userOwnsMpi) { |
There was a problem hiding this comment.
Should userOwnsMpi be a bool instead of int? It's only ever used as a boolean. Is there a repo convention for this, or is it to maintain C compatibility?
There was a problem hiding this comment.
Yes, but repo convention is to use int for historical (C) reasons
| MPI_Init(NULL, NULL); | ||
|
|
||
| MPI_Init(NULL, NULL); | ||
| MPI_Comm_dup(MPI_COMM_WORLD, &mpiCommQuest); |
There was a problem hiding this comment.
What if we simplify this by passing the communicator directly as a parameter instead of userOwnsMpi? If the user passes a communicator, QuEST uses it; if null/not provided, we initialize MPI and default to MPI_COMM_WORLD. This removes the ambiguity of who owns MPI and makes the API more explicit:
void comm_init(MPI_Comm userComm = MPI_COMM_NULL) {
if (userComm == MPI_COMM_NULL) {
MPI_Init(NULL, NULL);
MPI_Comm_dup(MPI_COMM_WORLD, &mpiCommQuest);
} else {
MPI_Comm_dup(userComm, &mpiCommQuest);
}
}There was a problem hiding this comment.
I think this is what I had in mind originally, but if I recall correctly the issue is that it brings MPI_Comm into the public headers, which we need to avoid as it causes headaches for users who are linking a non-MPI application.
I will double-check though!
| // set mpiCommQuest to user provided communicator | ||
| comm_setMpiComm(userQuestComm); | ||
|
|
||
| // initialise QuEST around that communicator | ||
| initCustomMpiQuESTEnv(USE_DISTRIB, USER_OWNS_MPI, useGpuAccel, useMultithread); |
There was a problem hiding this comment.
does this work correcty? initCustomMpiQuESTEnv calls on comm_init where we are overwriting the mpiCommQuest with MPI_COMM_WORLD, does not matter if the user has the ownsership.
void comm_init(int userOwnsMpi) {Expand commentComment on line R106
...
// QuEST must initialise MPI if the user does not own it
if (!userOwnsMpi)
MPI_Init(NULL, NULL);
MPI_Comm_dup(MPI_COMM_WORLD, &mpiCommQuest);
iarejula-bsc
left a comment
There was a problem hiding this comment.
In subcommunicator.cpp:17, userQuestComm appears to be duplicated into mpiCommQuest before the normal environment validation runs. If validation later fails under a throwing or custom error handler, the duplicated communicator may be left in mpiCommQuest with no cleanup, which is an MPI resource leak.
Haven't had time to trace this fully so it may be a non-issue depending on execution order, but probably worth a quick check.
|
|
||
| // safely callable before MPI initialisation, but NOT after comm_end() | ||
| int isInit; | ||
| MPI_Initialized(&isInit); |
There was a problem hiding this comment.
There are now two distinct initialized states that the code conflates into one:
MPI runtime initialized: MPI_Init() has been called
QuEST communicator initialized: mpiCommQuest != MPI_COMM_NULL
comm_isInit() only checks the first, So comm_isInit() returning true is not a safe guard against calling MPI with a null communicator.
Precise crash path:
1. MPI_Init()State After
MPI runtime: initialized
mpiCommQuest: MPI_COMM_NULL
QuEST env: not initialized
- initCustomMpiQuESTEnv(0, 1, 0, 0)
└─ validateAndInitCustomQuESTEnv(
useDistrib=0, userOwnsMpi=1,
useGpuAccel=0, useMultithread=0
)
├─ validate_envNeverInit(...)
├─ validate_newEnvDeploymentMode(...)
├─ autodep_chooseQuESTEnvDeployment(...)
├─ if (useDistrib)
│ └─ comm_init(userOwnsMpi)
│
│ ** skipped: useDistrib == 0 **
│
└─ validate_newEnvDistributedBetweenPower2Nodes(...)
└─ comm_getNumNodes()
├─ comm_isInit()
│ └─ MPI_Initialized(&isInit)
│ returns true <-- runtime is up, guard passes
│
└─ MPI_Comm_size(mpiCommQuest, &numNodes) CRASHState at Crash
MPI runtime: initialized (ok)
mpiCommQuest: MPI_COMM_NULL (!)
QuEST env: not initialized (!)
CRASH: MPI_Comm_size called with MPI_COMM_NULL -> abort
I suggest that comm_isInit() should distinguish between "MPI runtime is initialized" and "QuEST communicator is initialized", or all accessors should guard against mpiCommQuest != MPI_COMM_NULL.
| // set mpiCommQuest to user provided communicator | ||
| comm_setMpiComm(userQuestComm); | ||
|
|
||
| // initialise QuEST around that communicator | ||
| initCustomMpiQuESTEnv(useDistrib, userOwnsMpi, useGpuAccel, useMultithread); |
There was a problem hiding this comment.
I think there is a problem with the order of operations or you must ensure MPI is initialized: initCustomMpiCommQuESTEnv() duplicates the user communicator before verifying that the MPI runtime exists at all.
In the case were MPI is not being initialized, the guard will be skiped
void comm_setMpiComm(MPI_Comm newComm) {
if (mpiCommQuest != MPI_COMM_NULL) // false
error_commDoubleSetMpiComm();
MPI_Comm_dup(newComm, &mpiCommQuest); // <-- ... before any MPI_Init() check
}
We think it should allow that yes -- also useful for our CQ project. @JPRichings will explicitly add it to his testing to be sure. Thank you for catching some of the initialisation bugs -- I will have a more detailed look and try fix them later this week! Much appreciated. |
Nice! I was wondering whether using concurrent quregs at the same time could be an issue out of mpi. I have two main concerns:
|
Closes #712.
MPI_COMM_WORLDeverywhere. Instead it usesmpiCommQuest(which will very often just be a duplicate ofMPI_COMM_WORLD). This aligns with best practise for library developers, as it helps avoid avoid clashes with user MPI calls.mpiCommQueststuff was broken essentially any QuEST program would also be broken.MPI_COMM_WORLD. This interface is guarded behind compile-time flags as it requires bringing the MPI header into the interface, which means all consumers of libQuEST need to include MPI as well. This interface is not yet documented or tested. I think this is tolerable in the short term as it should only appeal to advanced users.Speaking of tests! Test results on my laptop:
and min_example output:
There is a caveat here: distributed tests on Cirrus keep failing, but this appears to be true of devel as well as this branch, so I'll do some more digging and raise it as a separate issue once I have a better of idea of what is going on.