📖 The AI Tool Bible

InfiBench

Stack Overflow-derived benchmark for evaluating code LLMs on real-world programming questions.

Free· Free and open source (CC BY-SA 4.0)Evaluation
Visit website →
Best for

Pick InfiBench if you need a code-LLM benchmark that reflects messy real-world developer questions rather than clean function-completion tasks.

Skip if

Skip it if you want a hosted eval-as-a-service or a benchmark covering agentic, multi-file, or repo-scale coding tasks.

InfiBench is an open-source evaluation benchmark for code-focused large language models, built from 234 carefully curated Stack Overflow questions spanning 15 programming languages. Unlike HumanEval or MBPP, which test pure code generation from synthetic prompts, InfiBench measures how well a model answers the kinds of mixed-format questions developers actually ask: debugging help, API usage, configuration, framework quirks, and idiomatic patterns. It was introduced at the NeurIPS 2024 Datasets and Benchmarks Track and ships with domain-expert-curated correctness criteria for every question.

The benchmark uses four evaluation metrics — keyword matching, fill-in-the-blank, unit testing, and dialogue similarity — to handle the heterogeneous nature of real Q&A responses. The project maintains a public leaderboard with 100+ evaluated models, making it useful as a reference point when comparing code-LLM capability beyond toy coding tasks. It is aimed squarely at researchers, model builders, and engineering teams choosing which code model to deploy.

Running it requires a Linux environment and Hugging Face Transformers-format models. The code and dataset are released under CC BY-SA 4.0 via GitHub. There is no hosted SaaS — InfiBench is a benchmark harness, not a product you log into.

Editor's take

InfiBench fills a real gap: most code benchmarks score isolated function writing, but engineers spend their day on Stack Overflow-shaped problems. The four-metric design is pragmatic, and the public leaderboard makes it immediately useful for model selection. Treat it as one signal among several, not a single source of truth.

— The AI Tool Bible editorial team

Pros

  • 234 real Stack Overflow questions across 15 languages, not synthetic toy prompts
  • Four complementary metrics handle free-form answers better than pass@k alone
  • Public leaderboard with 100+ evaluated models for direct comparison
  • Peer-reviewed at NeurIPS 2024 Datasets and Benchmarks Track

Cons

  • ⚠️ Linux-only harness with Hugging Face Transformers format requirement
  • ⚠️ Static 234-question set risks contamination as it ages
  • ⚠️ Research artifact, not a polished product — setup expects ML engineering comfort

Use cases

code-llm-evalmodel-benchmarkingleaderboard-comparisonresearch

Explore related

Compare with similar tools

All in Evaluation