[GSD-12610] clEnqueueSVMMemcpy with invalid destination pointer and committed source buffer wedges process unkillably (i915 kernel regression)

### Pre-submission Checklist

- [x] I am using the latest GPU driver version ([releases](https://github.com/intel/compute-runtime/releases))
- [x] I have searched for similar issues and found none

### GPU Hardware

Tested on Intel Arc A380 (dGPU) and Intel UHD 730 (iGPU) — both hang.
Does NOT reproduce on Intel Arc A770 or UHD 770 (different kernel version — see below).

### OS / Kernel

- **Hangs**: Ubuntu 24.04.3, kernel **6.17.0-20-generic**
- **Works**: Ubuntu 24.04.2, kernel **6.17.0-14-generic**

### OpenCL Runtime Version

```
Driver Version: 26.09.37435.1
intel-opencl-icd 26.09.37435.1-0
intel-igc-core-2 2.30.1
intel-igc-opencl-2 2.30.1
```

Identical on both machines.

### Summary

`clEnqueueSVMMemcpy` called with a destination pointer that is **not** a valid SVM/USM allocation should return `CL_INVALID_VALUE` or `CL_OUT_OF_RESOURCES`. On kernel 6.17.0-20 it instead enters a kernel-mode CPU loop in `i915_gem_object_userptr_submit_init` that makes the process **unkillable** — `SIGKILL` from root has no effect. Recovery requires rebooting the host.

On kernel 6.17.0-14 (same OpenCL driver, same hardware family), the call correctly returns `-5` (`CL_OUT_OF_RESOURCES`) immediately.

### Three conditions required to trigger

1. **Invalid SVM destination pointer** (e.g. `0xdeadbeef` — not a valid `clSVMAlloc` or USM allocation)
2. **Large transfer size** (tested at 4 GB)
3. **Source buffer must have committed physical pages** (`malloc` + `memset`). Uncommitted pages (`mmap MAP_NORESERVE`) take a different code path that correctly returns `-5`.

### Minimal reproducer

```c
#define CL_TARGET_OPENCL_VERSION 300
#include <CL/cl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char** argv) {
    int plat_idx = (argc >= 2) ? atoi(argv[1]) : 0;
    size_t mb    = (argc >= 3) ? (size_t)atoll(argv[2]) : 4096;
    size_t bytes = mb * 1024ull * 1024ull;

    cl_int err;
    cl_uint nplats = 0;
    clGetPlatformIDs(0, NULL, &nplats);
    cl_platform_id plats[8];
    clGetPlatformIDs(nplats < 8 ? nplats : 8, plats, NULL);
    cl_device_id dev;
    clGetDeviceIDs(plats[plat_idx], CL_DEVICE_TYPE_GPU, 1, &dev, NULL);

    char dname[256] = {0};
    clGetDeviceInfo(dev, CL_DEVICE_NAME, sizeof dname, dname, NULL);
    fprintf(stderr, "Device: %s, size: %zu MB\n", dname, mb);

    cl_context ctx = clCreateContext(NULL, 1, &dev, NULL, NULL, &err);
    cl_command_queue q = clCreateCommandQueueWithProperties(ctx, dev, NULL, &err);

    /* Committed source buffer — must be malloc+memset, not mmap */
    char* h = (char*)malloc(bytes);
    if (!h) { fprintf(stderr, "malloc failed\n"); return 3; }
    memset(h, 0xab, bytes);

    void* d = (void*)0xdeadbeefULL;  /* NOT a valid SVM allocation */

    fprintf(stderr, "clEnqueueSVMMemcpy dst=%p size=%zu\n", d, bytes);
    err = clEnqueueSVMMemcpy(q, CL_FALSE, d, h, bytes, 0, NULL, NULL);
    fprintf(stderr, "enqueue returned %d\n", (int)err);

    /* Expected: err != 0 (rejected).
       Observed on 6.17.0-20: process is already wedged above,
       never reaches this line. */
    if (err == CL_SUCCESS) {
        fprintf(stderr, "clFinish...\n");
        err = clFinish(q);
        fprintf(stderr, "clFinish returned %d\n", (int)err);
    }

    free(h);
    return (err != CL_SUCCESS) ? 0 : 1;
}
```

Build and run:
```bash
gcc -O2 cl_badptr.c -lOpenCL -o cl_badptr
# WARNING: on affected kernels this will wedge one CPU core until reboot
timeout --kill-after=10s 30s ./cl_badptr 0 4096
```

### Results

| Machine | Kernel | GPU | Result |
|---|---|---|---|
| cupcake | 6.17.0-**14**-generic | Arc A770 | `enqueue returned -5` — **OK** |
| cupcake | 6.17.0-**14**-generic | UHD 770 | `enqueue returned -5` — **OK** |
| meatloaf | 6.17.0-**20**-generic | Arc A380 | **HANG** at `clEnqueueSVMMemcpy` (exit 137) |
| meatloaf | 6.17.0-**20**-generic | UHD 730 | **HANG** at `clEnqueueSVMMemcpy` (exit 137) |

With `mmap(MAP_NORESERVE)` instead of `malloc+memset` for the source, **all four combinations pass** (return -5). The committed-pages requirement points to the i915 userptr/scatter-gather DMA setup path.

### Kernel call trace (dmesg hung_task watchdog, from earlier investigation)

```
intel_iommu_map_pages+0xe7/0x140
iommu_map_nosync+0x133/0x2b0
iommu_map_sg+0xc8/0x1b0
iommu_dma_map_sg+0x59a/0x630
 ? i915_gem_shrink+0x6af/0x7a0 [i915]
__dma_map_sg_attrs+0x13b/0x1b0
dma_map_sg_attrs+0xe/0x30
i915_gem_gtt_prepare_pages+0x55/0x90 [i915]
i915_gem_userptr_get_pages+0xf3/0x200 [i915]
____i915_gem_object_get_pages+0x23/0x70 [i915]
i915_gem_object_userptr_submit_init+0x38c/0x420 [i915]
eb_lookup_vmas+0x141/0x290 [i915]
```

### Process state while wedged

```
State:   R (running)
SigBlk:  0000000000000000
WCHAN:   -
%CPU:    99.9
```

On-CPU inside a kernel function, no pending signals checked.

### Impact

- Wedged process does NOT poison the GPU for other workloads.
- Permanently consumes one CPU core and several GB of RSS.
- Only host reboot clears it.

### Notes

- We discovered this via [chipStar](https://github.com/CHIP-SPV/chipStar) (a HIP-on-SPIR-V/OpenCL implementation). chipStar uses `clEnqueueSVMMemcpy` for `hipMemcpy` in its Intel USM allocation strategy. When a HIP application ignores a failed `hipMalloc` and proceeds to `hipMemcpy` with the invalid pointer, chipStar forwards it to `clEnqueueSVMMemcpy`, hitting this bug.
- The OpenCL runtime could also add defense-in-depth by validating SVM pointers before delegating to the kernel, but the root cause appears to be an **i915 kernel regression between 6.17.0-14 and 6.17.0-20**.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSD-12610] clEnqueueSVMMemcpy with invalid destination pointer and committed source buffer wedges process unkillably (i915 kernel regression) #914

Pre-submission Checklist

GPU Hardware

OS / Kernel

OpenCL Runtime Version

Summary

Three conditions required to trigger

Minimal reproducer

Results

Kernel call trace (dmesg hung_task watchdog, from earlier investigation)

Process state while wedged

Impact

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Machine	Kernel	GPU	Result
cupcake	6.17.0-14-generic	Arc A770	`enqueue returned -5` — OK
cupcake	6.17.0-14-generic	UHD 770	`enqueue returned -5` — OK
meatloaf	6.17.0-20-generic	Arc A380	HANG at `clEnqueueSVMMemcpy` (exit 137)
meatloaf	6.17.0-20-generic	UHD 730	HANG at `clEnqueueSVMMemcpy` (exit 137)

[GSD-12610] clEnqueueSVMMemcpy with invalid destination pointer and committed source buffer wedges process unkillably (i915 kernel regression) #914

Description

Pre-submission Checklist

GPU Hardware

OS / Kernel

OpenCL Runtime Version

Summary

Three conditions required to trigger

Minimal reproducer

Results

Kernel call trace (dmesg hung_task watchdog, from earlier investigation)

Process state while wedged

Impact

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions