Implements TurboQuant (ICLR 2026, arXiv:2504.19874) KV cache compression directly inside a Transformers inference script. All algorithms are self-contained. Minimal dependencies.
- uses https://huggingface.co/g023/Qwen3-1.77B-g023 as the demonstration model (throw model files in Qwen3-BEST folder)
Comments URL: https://news.ycombinator.com/item?id=47633195
Points: 3
# Comments: 2