flatreader

Show HN: AIBenchy – Independent AI Leaderboard

Hey HN, Like many of you, I'm tired of public AI leaderboards that mostly recycle the same saturated/overfitted benchmarks (MMLU, HumanEval, etc.) and often miss fast/cheap variants or real daily pain points.

A couple days ago I launched AIBenchy — a small, opinionated leaderboard running my own custom tests focused on end-user/dev scenarios that actually trip up models today.

Current tests cover categories like:

- Anti-AI Tricks (classic gotchas like "count the Rs in strawberry", logic traps)

- Instruction following & consistency

- Data parsing/extraction

- Domain-specific tasks

- Puzzle solving / edge-case reasoning

Recent additions (just pushed today):

- Reasoning score (new!): A separate judge LLM evaluates the chain-of-thought for efficiency — does it repeat itself, loop, think forever, brute-force enumerate every possibility (looking at you, some Qwen-3.5 runs), or get to the point cleanly? This penalizes "cheaty" high-token reasoning even if the final answer is correct. Goal: reward smart, concise thinking over exhaustive trial-and-error.

- Stability metric: Measures consistency across runs (some models flake on the same prompt).

Right now the leaderboard has ~20 models (Qwen3.5 Plus currently topping it, followed by GLM 5, various GPT/Claude variants, etc.), but it's super early/WIP:

- Manual runs + small test set - No public submission of tests yet (open to ideas!) - Focused on transparency & practical usefulness over massive scale

I'd love feedback from HN:

- What custom tests / gotchas / use-cases should I add next?

- Thoughts on the reasoning score — fair way to judge efficiency, or too subjective?

- Models/variants I'm missing (especially fast/cheap ones ignored elsewhere)?

- Should I let people submit their own prompts/tests eventually?

Thanks for checking it out: https://aibenchy.com

Appreciate any roast/ideas — building this to scratch my own itch.

Comments URL: https://news.ycombinator.com/item?id=47056436

Points: 1

# Comments: 1