Skip to content

huggingface/grout

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

grout

A Qwen 3 inference engine written in Rust, build using cutile-rs.

Usage

  1. Follow the installation guide of cutile-rs.
  2. Configure your environment variables for cutile-rs.
  • Set CUDA_TOOLKIT_PATH to your CUDA 13.2 install directory.
  • Ensure llvm-config points to LLVM 21. This is required by melior. Or, set LLVM_SYSPATHXX.
  • Set CUDA_TILE_USE_LLVM_INSTALL_DIR to your LLVM 21 install directory (for example /usr/lib/llvm-21). This is required by cutile-rs.
CUDA_TILE_USE_LLVM_INSTALL_DIR=/usr/lib/llvm-21 cargo +nightly run --release -- --model <path-to-qwen3-model> --prompt "Hello, how are you?" --max-new-tokens 128

Options

Flag Description
--model <path> Path to model directory (safetensors + config.json)
--prompt <text> Input prompt
--max-new-tokens <n> Number of tokens to generate (default: 128)
--max-seq-len <n> Override max sequence length
--sample Enable sampling (default: greedy)
--raw-prompt Skip chat template, use prompt as-is
--device-argmax Run argmax on device
--profile Print per-step profiling report

Environment Variables

Variable Description
GROUT_CUDA_GRAPH_DECODE Set to 1 to enable CUDA graph capture for decode
GROUT_CUBLAS_COMPUTE16 Set to 1 to use FP16 accumulation in cuBLAS
GROUT_CUBLAS_COMPUTE16_MAX_M Max M dimension for FP16 compute
GROUT_CUBLAS_FAST_ALGO cuBLAS algorithm selection
GROUT_ATTN_BN_DECODE Attention block size for decode

License

Apache-2.0

About

Testbed for LLM inference with cutile-rs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages