Berkeley Function-Calling Leaderboard
✓ Editorially verifiedOpen benchmark from UC Berkeley that ranks LLMs on real-world tool-use and function-calling accuracy.
Pick BFCL if you're choosing an LLM for an agent or tool-using app and need defensible numbers on function-calling quality.
Skip it if you want a general chatbot leaderboard or a turnkey eval SaaS with dashboards and team features.
The Berkeley Function-Calling Leaderboard (BFCL) is a public evaluation benchmark from UC Berkeley's Gorilla research group that measures how well large language models execute function calls and use tools. Now on version 4, it scores native function-calling and prompt-based approaches across multi-turn dialogue, web-search-augmented calls, format sensitivity, and overall unweighted accuracy across sub-categories, with cost and latency tracked alongside correctness.
It's aimed at researchers, framework authors, and engineering teams who are picking a model for an agent stack and need a more rigorous answer than vendor marketing. The benchmark, dataset, and harness are open source on GitHub and pinnable via the `bfcl-eval` PyPI package, so you can reproduce numbers locally rather than trust the dashboard alone. The leaderboard itself is free; the only cost is the inference you spend re-running it. Work behind BFCL was presented at ICML 2025.
The site also ships a live demo for poking at function-calling behavior with custom prompts and tool schemas, a 'Wagon Wheel' radar chart for side-by-side model comparison, and a Discord for community contributions and new task submissions. It is not a product you buy; it's reference infrastructure for the agent-tooling ecosystem.
BFCL is one of the few function-calling benchmarks serious agent teams actually cite, and the fact that the harness is pip-installable means you can verify the numbers instead of squinting at a webpage. Treat it as a strong prior, then re-run on your own tool schemas before committing a model.
— The AI Tool Bible editorial team
Pros
- ✅ Reproducible: open dataset, harness, and pip-installable eval package
- ✅ Covers multi-turn, web search, format sensitivity, not just single-shot calls
- ✅ Tracks cost and latency alongside accuracy
- ✅ Backed by peer-reviewed work (ICML 2025) and actively updated
- ✅ Interactive demo lets you sanity-check models on your own schemas
Cons
- ⚠️ Academic benchmark, not a managed product or SLA
- ⚠️ Function-calling focus only; not a general LLM leaderboard
- ⚠️ Reproducing top runs can get expensive on frontier APIs
Use cases
Explore related
Compare with similar tools
All in Evaluation →Braintrust
FeaturedEval, monitor, and improve AI products end-to-end.
LangSmith
LangChain's eval + observability platform.
Weights & Biases
The ML experiment tracker, now with LLM eval features.
Helicone
Open-source LLM observability — one-line proxy install.
Humanloop
Prompt management + evals for collaborative AI teams.
PromptLayer
Lightweight prompt logging + management for OpenAI/Claude apps.