Berkeley Function-Calling Leaderboard

✓ Editorially verified

Open benchmark from UC Berkeley that ranks LLMs on real-world tool-use and function-calling accuracy.

Free· Free and open source; you pay only for inference when reproducing runs.EvaluationMulti-model

Best for

Pick BFCL if you're choosing an LLM for an agent or tool-using app and need defensible numbers on function-calling quality.

Skip if

Skip it if you want a general chatbot leaderboard or a turnkey eval SaaS with dashboards and team features.

The Berkeley Function-Calling Leaderboard (BFCL) is a public evaluation benchmark from UC Berkeley's Gorilla research group that measures how well large language models execute function calls and use tools. Now on version 4, it scores native function-calling and prompt-based approaches across multi-turn dialogue, web-search-augmented calls, format sensitivity, and overall unweighted accuracy across sub-categories, with cost and latency tracked alongside correctness.

It's aimed at researchers, framework authors, and engineering teams who are picking a model for an agent stack and need a more rigorous answer than vendor marketing. The benchmark, dataset, and harness are open source on GitHub and pinnable via the `bfcl-eval` PyPI package, so you can reproduce numbers locally rather than trust the dashboard alone. The leaderboard itself is free; the only cost is the inference you spend re-running it. Work behind BFCL was presented at ICML 2025.

The site also ships a live demo for poking at function-calling behavior with custom prompts and tool schemas, a 'Wagon Wheel' radar chart for side-by-side model comparison, and a Discord for community contributions and new task submissions. It is not a product you buy; it's reference infrastructure for the agent-tooling ecosystem.

Editor's take

BFCL is one of the few function-calling benchmarks serious agent teams actually cite, and the fact that the harness is pip-installable means you can verify the numbers instead of squinting at a webpage. Treat it as a strong prior, then re-run on your own tool schemas before committing a model.

— The AI Tool Bible editorial team