OptiLLMBench is a new benchmark designed to evaluate how different inference optimization techniques (like ReRead, Chain-of-Thought, etc.) can improve LLM performance without any model changes or fine-tuning.
To help understand real-world impact, I've included first results with Gemini 2.0 Flash:
ReRead (RE2): +5% accuracy, 2x faster Chain-of-Thought Reflection: +5% boost Base performance: 51%
The benchmark evaluates models on:
Math word problems (GSM8K) Formal mathematics (MMLU Math) Logical reasoning (AQUA-RAT) Yes/no comprehension (BoolQ)
The code works as a drop-in proxy - just point your OpenAI compatible endpoint to it and it'll apply the optimizations automatically.
Dataset: https://huggingface.co/datasets/codelion/optillmbench Code: https://github.com/codelion/optillm
Would love feedback from the HN community on additional optimization techniques to include or ways to improve the benchmark.
Note: The dataset and proxy are completely open source and support any OpenAI API compatible endpoint.
Comments URL: https://news.ycombinator.com/item?id=43075152
Points: 2
# Comments: 0