📖 The AI Tool Bible

Berkeley Function-Calling Leaderboard

✓ Editorially verified

Open benchmark from UC Berkeley that ranks LLMs on real-world tool-use and function-calling accuracy.

Free· Free and open source; you pay only for inference when reproducing runs.EvaluationMulti-model
Visit website →
Best for

Pick BFCL if you're choosing an LLM for an agent or tool-using app and need defensible numbers on function-calling quality.

Skip if

Skip it if you want a general chatbot leaderboard or a turnkey eval SaaS with dashboards and team features.

The Berkeley Function-Calling Leaderboard (BFCL) is a public evaluation benchmark from UC Berkeley's Gorilla research group that measures how well large language models execute function calls and use tools. Now on version 4, it scores native function-calling and prompt-based approaches across multi-turn dialogue, web-search-augmented calls, format sensitivity, and overall unweighted accuracy across sub-categories, with cost and latency tracked alongside correctness.

It's aimed at researchers, framework authors, and engineering teams who are picking a model for an agent stack and need a more rigorous answer than vendor marketing. The benchmark, dataset, and harness are open source on GitHub and pinnable via the `bfcl-eval` PyPI package, so you can reproduce numbers locally rather than trust the dashboard alone. The leaderboard itself is free; the only cost is the inference you spend re-running it. Work behind BFCL was presented at ICML 2025.

The site also ships a live demo for poking at function-calling behavior with custom prompts and tool schemas, a 'Wagon Wheel' radar chart for side-by-side model comparison, and a Discord for community contributions and new task submissions. It is not a product you buy; it's reference infrastructure for the agent-tooling ecosystem.

Editor's take

BFCL is one of the few function-calling benchmarks serious agent teams actually cite, and the fact that the harness is pip-installable means you can verify the numbers instead of squinting at a webpage. Treat it as a strong prior, then re-run on your own tool schemas before committing a model.

— The AI Tool Bible editorial team

Pros

  • Reproducible: open dataset, harness, and pip-installable eval package
  • Covers multi-turn, web search, format sensitivity, not just single-shot calls
  • Tracks cost and latency alongside accuracy
  • Backed by peer-reviewed work (ICML 2025) and actively updated
  • Interactive demo lets you sanity-check models on your own schemas

Cons

  • ⚠️ Academic benchmark, not a managed product or SLA
  • ⚠️ Function-calling focus only; not a general LLM leaderboard
  • ⚠️ Reproducing top runs can get expensive on frontier APIs

Use cases

function-calling evaltool-use benchmarkingagent model selectionmulti-turn evalcost/latency comparison

Explore related

Compare with similar tools

All in Evaluation