Weco AI
Autoresearch engine that iteratively rewrites code to optimize against a numeric evaluation metric.
Pick Weco AI if you have a measurable objective (kernel speed, model accuracy, prompt score) and want an agent to iterate against it autonomously.
Skip it if your problem has no single numeric metric, or if you just need a one-time refactor a normal coding copilot can do in one pass.
Weco AI is a code-optimization platform built around the AIDE algorithm, which pairs LLM-driven code proposals with tree search to iteratively improve a codebase against a user-supplied evaluation script. You hand it source code plus an eval that emits a number (latency, accuracy, memory, cost, throughput, quality score), and it loops: propose change, run eval, read metric, branch on what improved. The team describes the product as 'recursively self-improving AI' and ships a CLI (weco-cli) backed by docs at docs.weco.ai.
It's aimed at ML and systems engineers who have problems where the optimum isn't obvious and brute-force experimentation pays off: GPU kernel tuning (CUDA, Triton), model architecture tweaks, prompt engineering with measurable scoring, and general perf work. The same group is behind AIDE (the agent that posted human-level results on Kaggle-style data science competitions) and the Aiden agent that placed top in OpenAI's hiring challenge, so the research pedigree is real. Pricing isn't published on the marketing site; the CLI is open and the hosted autoresearch service appears to be the commercial layer.
It's language-agnostic and hardware-agnostic because the only contract is 'your eval prints a number.' That makes it powerful for the niche it serves and useless for tasks where success can't be expressed numerically or where a one-shot edit would do.
Weco is one of the more intellectually honest 'agent' products out there - it refuses to pretend it can optimize what you can't measure. For ML and systems engineers with a real eval harness, the AIDE-driven loop is a credible alternative to hand-tuning. Outside that niche it's not the tool you want.
— The AI Tool Bible editorial team
Pros
- ✅ Metric-driven optimization loop is principled, not vibes-based
- ✅ Language and hardware agnostic - only needs a numeric eval
- ✅ Strong research pedigree (AIDE, Aiden, SpecBench)
- ✅ Open CLI (weco-cli) lowers integration friction
- ✅ Genuinely useful for GPU kernel and ML perf work
Cons
- ⚠️ Only works when success can be expressed as a single number
- ⚠️ Pricing for hosted product not publicly disclosed
- ⚠️ Overkill for one-shot code edits or qualitative tasks
- ⚠️ Smaller community than mainstream AI eval tools
Use cases
Explore related
Compare with similar tools
All in Evaluation →Braintrust
FeaturedEval, monitor, and improve AI products end-to-end.
LangSmith
LangChain's eval + observability platform.
Weights & Biases
The ML experiment tracker, now with LLM eval features.
Helicone
Open-source LLM observability — one-line proxy install.
Humanloop
Prompt management + evals for collaborative AI teams.
PromptLayer
Lightweight prompt logging + management for OpenAI/Claude apps.