A Qwen 3 inference engine written in Rust, build using cutile-rs.
- Follow the installation guide of
cutile-rs. - Configure your environment variables for
cutile-rs.
- Set
CUDA_TOOLKIT_PATHto your CUDA 13.2 install directory. - Ensure
llvm-configpoints to LLVM 21. This is required bymelior. Or, setLLVM_SYSPATHXX. - Set
CUDA_TILE_USE_LLVM_INSTALL_DIRto your LLVM 21 install directory (for example/usr/lib/llvm-21). This is required bycutile-rs.
CUDA_TILE_USE_LLVM_INSTALL_DIR=/usr/lib/llvm-21 cargo +nightly run --release -- --model <path-to-qwen3-model> --prompt "Hello, how are you?" --max-new-tokens 128
| Flag | Description |
|---|---|
--model <path> |
Path to model directory (safetensors + config.json) |
--prompt <text> |
Input prompt |
--max-new-tokens <n> |
Number of tokens to generate (default: 128) |
--max-seq-len <n> |
Override max sequence length |
--sample |
Enable sampling (default: greedy) |
--raw-prompt |
Skip chat template, use prompt as-is |
--device-argmax |
Run argmax on device |
--profile |
Print per-step profiling report |
| Variable | Description |
|---|---|
GROUT_CUDA_GRAPH_DECODE |
Set to 1 to enable CUDA graph capture for decode |
GROUT_CUBLAS_COMPUTE16 |
Set to 1 to use FP16 accumulation in cuBLAS |
GROUT_CUBLAS_COMPUTE16_MAX_M |
Max M dimension for FP16 compute |
GROUT_CUBLAS_FAST_ALGO |
cuBLAS algorithm selection |
GROUT_ATTN_BN_DECODE |
Attention block size for decode |
Apache-2.0