Five mistakes to avoid when building an AI agent
POCs that never reach production share an architecture problem, not a model problem.
The demo looks great, everyone is excited. Then production: hallucinations, blowups, runaway costs. It's the same five mistakes, in the same order.
1. One mega-prompt that does everything
Pile ten jobs into one LLM call and each runs weaker. Snip → analysis → decide → write — break them into separate, well-scoped calls.
Practical test
If reading your prompt aloud leaves you breathless, it's too long.
2. Treating RAG as "embed everything, fetch nearest"
Naive RAG is the worst enemy. What you actually need:
- Hybrid retrieval: vector + BM25 keyword, combined.
- Reranker: pull top 50, drop to top 5 with a cross-encoder.
- Context packing: with citations and source metadata.
3. Shipping without evals
The only way to catch behavior drift is your eval set. 20–50 golden cases:
- Tweak a prompt → run evals, never ship a regression.
- Upgrade a model → run evals, verify it's actually better.
const cases = await loadEvalCases();
const results = await Promise.all(
cases.map(async (c) => ({
input: c.input,
expected: c.expected,
actual: await agent.run(c.input),
}))
);
console.table(results.map(scoreOne));
4. Designing tool calling as "one endpoint, one tool"
An agent looking at 50 tools gets lost. Compose them: not find_customer but search_customers(filter, fields). Few, flexible.
5. Bolting human-in-the-loop on after the fact
If an approval layer isn't in the design from day one, you'll be writing if-trees forever. Build an explicit approval queue for material actions (send email, take payment, archive customer).
Cost warning
An agent without a control plane can burn the monthly API budget overnight. Rate limits + per-task budgets belong in the agent's infrastructure, not the software's.
Five mistakes, one theme: build the agent like software. Tests, monitoring, budgets, rollback. The discipline you apply to every other production system applies here too.