Ctrls performance#739
Conversation
|
Todo:
Assumptions in this code:
|
|
Alloc size 64 qubits -- add as macro somewhere if not done already |
| const int NUM_THREADS_PER_BLOCK = 128; | ||
| const int NUM_THREADS_PER_BLOCK =128; | ||
|
|
||
| __device__ __constant__ int ctrl_device[30]; |
There was a problem hiding this comment.
TODO: use a MAX_NUM_QUBITS = 64 or something constant in constants.hpp
|
Reminder of other stuff from meeting:
|
|
Way easier to do this instead: Beginning with CUDA 12.1, you can now pass up to 32,764 bytes as kernel parameters on NVIDIA Volta and above Note that in both preceding examples, kernel parameters are annotated with the grid_constant qualifier to indicate they are read-only. reference: https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/ I think this solves our concerns about multi-qureg operations both accessing a common ctrls cache in future. (this is also much better than the variadic kernel idea I had earlier today) Other benefits:
|
|
Profiling to confirm but here is an extra factor of 2 over #729 (comment): |
|
Is it possible to pass multiple, distinct arguments to the one function that way? E.g. so that the The pattern of Orthogonally, do we have consternations about bumping min CUDA to 12.1 (released 2023)? |
|
I think it is possible to have multiple arguments passed this way. All this is (and its what I tried to first this this optimisation) capture by value in the kernel launch a c style array. The I think this is much cleaner that the data movement to const memory as instead of pointers to device constant memory we just pass the array directly. Only change to the original kernel (before all this optimisation stuff) is then we need to access a data member of a struct. There is no other change inside the kernel and we aren't accessing a global device variable out of the blue mid kernel. In We probably don't need to to change cuda versions unless we think we are passing over more than 4096 bytes in the kernel. I don't think we are anywhere near this but no objection to moving to CUDA 12.1. |
|
Finally usual caveats that correctness checking needs to take place with a full run of the test suite! |


Initial pass on performance changed to remove thrust device_vector that is used to move ctrls to device that is causing a performance impact due to the thrust device_vector moving data from host to device on construction.