Been building production agentic AI (agents that execute, not just chat) for the last year. Most demos crash in the real world because:
- RAG chunking/embedding choices destroy retrieval quality (we ended up hybrid Pinecone + FAISS + strict thresholds)
- ReAct loops run away and burn 10× tokens (memory caps + probabilistic exits fixed it)
- No real error handling for flaky APIs or garbage input (custom backoff + fallback LLMs)
- Governance logging without killing speed (event sourcing + replayable traces)
End result: 60% less human touch, but only after sweating these details.
What’s the nastiest production surprise you’ve hit with agentic builds?