Skip to content

[GSD-12610] clEnqueueSVMMemcpy with invalid destination pointer and committed source buffer wedges process unkillably (i915 kernel regression) #914

@pvelesko

Description

@pvelesko

Pre-submission Checklist

  • I am using the latest GPU driver version (releases)
  • I have searched for similar issues and found none

GPU Hardware

Tested on Intel Arc A380 (dGPU) and Intel UHD 730 (iGPU) — both hang.
Does NOT reproduce on Intel Arc A770 or UHD 770 (different kernel version — see below).

OS / Kernel

  • Hangs: Ubuntu 24.04.3, kernel 6.17.0-20-generic
  • Works: Ubuntu 24.04.2, kernel 6.17.0-14-generic

OpenCL Runtime Version

Driver Version: 26.09.37435.1
intel-opencl-icd 26.09.37435.1-0
intel-igc-core-2 2.30.1
intel-igc-opencl-2 2.30.1

Identical on both machines.

Summary

clEnqueueSVMMemcpy called with a destination pointer that is not a valid SVM/USM allocation should return CL_INVALID_VALUE or CL_OUT_OF_RESOURCES. On kernel 6.17.0-20 it instead enters a kernel-mode CPU loop in i915_gem_object_userptr_submit_init that makes the process unkillableSIGKILL from root has no effect. Recovery requires rebooting the host.

On kernel 6.17.0-14 (same OpenCL driver, same hardware family), the call correctly returns -5 (CL_OUT_OF_RESOURCES) immediately.

Three conditions required to trigger

  1. Invalid SVM destination pointer (e.g. 0xdeadbeef — not a valid clSVMAlloc or USM allocation)
  2. Large transfer size (tested at 4 GB)
  3. Source buffer must have committed physical pages (malloc + memset). Uncommitted pages (mmap MAP_NORESERVE) take a different code path that correctly returns -5.

Minimal reproducer

#define CL_TARGET_OPENCL_VERSION 300
#include <CL/cl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char** argv) {
    int plat_idx = (argc >= 2) ? atoi(argv[1]) : 0;
    size_t mb    = (argc >= 3) ? (size_t)atoll(argv[2]) : 4096;
    size_t bytes = mb * 1024ull * 1024ull;

    cl_int err;
    cl_uint nplats = 0;
    clGetPlatformIDs(0, NULL, &nplats);
    cl_platform_id plats[8];
    clGetPlatformIDs(nplats < 8 ? nplats : 8, plats, NULL);
    cl_device_id dev;
    clGetDeviceIDs(plats[plat_idx], CL_DEVICE_TYPE_GPU, 1, &dev, NULL);

    char dname[256] = {0};
    clGetDeviceInfo(dev, CL_DEVICE_NAME, sizeof dname, dname, NULL);
    fprintf(stderr, "Device: %s, size: %zu MB\n", dname, mb);

    cl_context ctx = clCreateContext(NULL, 1, &dev, NULL, NULL, &err);
    cl_command_queue q = clCreateCommandQueueWithProperties(ctx, dev, NULL, &err);

    /* Committed source buffer — must be malloc+memset, not mmap */
    char* h = (char*)malloc(bytes);
    if (!h) { fprintf(stderr, "malloc failed\n"); return 3; }
    memset(h, 0xab, bytes);

    void* d = (void*)0xdeadbeefULL;  /* NOT a valid SVM allocation */

    fprintf(stderr, "clEnqueueSVMMemcpy dst=%p size=%zu\n", d, bytes);
    err = clEnqueueSVMMemcpy(q, CL_FALSE, d, h, bytes, 0, NULL, NULL);
    fprintf(stderr, "enqueue returned %d\n", (int)err);

    /* Expected: err != 0 (rejected).
       Observed on 6.17.0-20: process is already wedged above,
       never reaches this line. */
    if (err == CL_SUCCESS) {
        fprintf(stderr, "clFinish...\n");
        err = clFinish(q);
        fprintf(stderr, "clFinish returned %d\n", (int)err);
    }

    free(h);
    return (err != CL_SUCCESS) ? 0 : 1;
}

Build and run:

gcc -O2 cl_badptr.c -lOpenCL -o cl_badptr
# WARNING: on affected kernels this will wedge one CPU core until reboot
timeout --kill-after=10s 30s ./cl_badptr 0 4096

Results

Machine Kernel GPU Result
cupcake 6.17.0-14-generic Arc A770 enqueue returned -5OK
cupcake 6.17.0-14-generic UHD 770 enqueue returned -5OK
meatloaf 6.17.0-20-generic Arc A380 HANG at clEnqueueSVMMemcpy (exit 137)
meatloaf 6.17.0-20-generic UHD 730 HANG at clEnqueueSVMMemcpy (exit 137)

With mmap(MAP_NORESERVE) instead of malloc+memset for the source, all four combinations pass (return -5). The committed-pages requirement points to the i915 userptr/scatter-gather DMA setup path.

Kernel call trace (dmesg hung_task watchdog, from earlier investigation)

intel_iommu_map_pages+0xe7/0x140
iommu_map_nosync+0x133/0x2b0
iommu_map_sg+0xc8/0x1b0
iommu_dma_map_sg+0x59a/0x630
 ? i915_gem_shrink+0x6af/0x7a0 [i915]
__dma_map_sg_attrs+0x13b/0x1b0
dma_map_sg_attrs+0xe/0x30
i915_gem_gtt_prepare_pages+0x55/0x90 [i915]
i915_gem_userptr_get_pages+0xf3/0x200 [i915]
____i915_gem_object_get_pages+0x23/0x70 [i915]
i915_gem_object_userptr_submit_init+0x38c/0x420 [i915]
eb_lookup_vmas+0x141/0x290 [i915]

Process state while wedged

State:   R (running)
SigBlk:  0000000000000000
WCHAN:   -
%CPU:    99.9

On-CPU inside a kernel function, no pending signals checked.

Impact

  • Wedged process does NOT poison the GPU for other workloads.
  • Permanently consumes one CPU core and several GB of RSS.
  • Only host reboot clears it.

Notes

  • We discovered this via chipStar (a HIP-on-SPIR-V/OpenCL implementation). chipStar uses clEnqueueSVMMemcpy for hipMemcpy in its Intel USM allocation strategy. When a HIP application ignores a failed hipMalloc and proceeds to hipMemcpy with the invalid pointer, chipStar forwards it to clEnqueueSVMMemcpy, hitting this bug.
  • The OpenCL runtime could also add defense-in-depth by validating SVM pointers before delegating to the kernel, but the root cause appears to be an i915 kernel regression between 6.17.0-14 and 6.17.0-20.

Metadata

Metadata

Assignees

No one assigned

    Labels

    OS: LinuxIssue specific to Linux distributions (Ubuntu, Fedora, RHEL, etc.)Status: TransferredIssue has been transferred to another component/team (e.g., KMD, firmware, hardware)Type: BugGeneral bug report, unexpected behavior or crash

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions