InfiBench
Stack Overflow-derived benchmark for evaluating code LLMs on real-world programming questions.
Pick InfiBench if you need a code-LLM benchmark that reflects messy real-world developer questions rather than clean function-completion tasks.
Skip it if you want a hosted eval-as-a-service or a benchmark covering agentic, multi-file, or repo-scale coding tasks.
InfiBench is an open-source evaluation benchmark for code-focused large language models, built from 234 carefully curated Stack Overflow questions spanning 15 programming languages. Unlike HumanEval or MBPP, which test pure code generation from synthetic prompts, InfiBench measures how well a model answers the kinds of mixed-format questions developers actually ask: debugging help, API usage, configuration, framework quirks, and idiomatic patterns. It was introduced at the NeurIPS 2024 Datasets and Benchmarks Track and ships with domain-expert-curated correctness criteria for every question.
The benchmark uses four evaluation metrics — keyword matching, fill-in-the-blank, unit testing, and dialogue similarity — to handle the heterogeneous nature of real Q&A responses. The project maintains a public leaderboard with 100+ evaluated models, making it useful as a reference point when comparing code-LLM capability beyond toy coding tasks. It is aimed squarely at researchers, model builders, and engineering teams choosing which code model to deploy.
Running it requires a Linux environment and Hugging Face Transformers-format models. The code and dataset are released under CC BY-SA 4.0 via GitHub. There is no hosted SaaS — InfiBench is a benchmark harness, not a product you log into.
InfiBench fills a real gap: most code benchmarks score isolated function writing, but engineers spend their day on Stack Overflow-shaped problems. The four-metric design is pragmatic, and the public leaderboard makes it immediately useful for model selection. Treat it as one signal among several, not a single source of truth.
— The AI Tool Bible editorial team
Pros
- ✅ 234 real Stack Overflow questions across 15 languages, not synthetic toy prompts
- ✅ Four complementary metrics handle free-form answers better than pass@k alone
- ✅ Public leaderboard with 100+ evaluated models for direct comparison
- ✅ Peer-reviewed at NeurIPS 2024 Datasets and Benchmarks Track
Cons
- ⚠️ Linux-only harness with Hugging Face Transformers format requirement
- ⚠️ Static 234-question set risks contamination as it ages
- ⚠️ Research artifact, not a polished product — setup expects ML engineering comfort
Use cases
Explore related
Compare with similar tools
All in Evaluation →Braintrust
FeaturedEval, monitor, and improve AI products end-to-end.
LangSmith
LangChain's eval + observability platform.
Weights & Biases
The ML experiment tracker, now with LLM eval features.
Helicone
Open-source LLM observability — one-line proxy install.
Humanloop
Prompt management + evals for collaborative AI teams.
PromptLayer
Lightweight prompt logging + management for OpenAI/Claude apps.