adds cohere-transcribe INT4/INT8 via onnx runtime #75
Conversation
|
Tested end-to-end on Windows 11 with RTX 3090, ONNX Runtime 2.0.0-rc.12. Model loads and transcribes correctly on CPU. DirectML fails with inference error: Non-zero status code returned while running Reshape node. Name:'node_view_332' — INT4 weight-only quantization is not compatible with the DirectML execution provider. An FP16 or INT8 export would be needed for GPU acceleration. I have not yet tested these because my use case is CPU only at this time. Given the quality of the model to its size, it's probably an investment worth making. |
|
Can you provide the onnx download links you used? I will pull it in as soon as I can test |
|
Sure, Model files from https://huggingface.co/cstr/cohere-transcribe-onnx-int4 Tarball I packaged for the Handy integration (encoder + decoder + tokens.txt): https://github.com/praxeo/Handy/releases/download/v1.0.0-cohere/cohere-int4.tar.gz |
|
Thank you! Hope to get this merged in a few hours |
|
I believe Handy will ship the int8 variant int8: https://huggingface.co/tristanripke/cohere-transcribe-onnx-int8 |
cstr/cohere-transcribe-onnx-int4
CPU inference only