Blog

AI knows your code. It doesn't know your product.

How we documented a 10-service product so agents could review PRDs — and what it cost. Context, not model size, is the bottleneck for AI-driven development.


Picture your product: 10+ services, the usual tangle. Your PM brings a new feature request, and you need to ship it.

Assume your devs already work with Claude Code every day, and each service has its own CLAUDE.md describing what it does. At the code level, you’re already AI-assisted. The workflow:

  1. PRD review — analyze the PRD, spot the gaps, ask clarifying questions, confirm everyone understands the request.
  2. Tech design — describe the solution: which services change, how, and why.
  3. Task breakdown — create tasks per service, split across frontend and backend.
  4. Implementation — devs build their tasks.
  5. QA — test each task, then run full integration tests.
  6. Release — ship it.

AI helps at exactly one of these: step 4. Every other step still runs on raw human effort. So how do we boost the rest?

The blocker: AI doesn’t know your product

How could an AI review a PRD, or design changes across 10 services it’s never seen? It can’t. The missing ingredient is context — and not the per-service CLAUDE.md files, which describe each service in isolation, but an understanding of the product: how the services fit together and what’s already shipped.

Building the context

What I did:

  • Created a dedicated product repo.
  • Documented every service, with paths to each codebase.
  • Copied all existing PRDs and tech docs from Confluence into it.

Are those docs complete? Never — documentation always lags the code. So the next step closes the gap automatically: point an agent at the codebase to find everything undocumented — every page, every handler — and flag it. It’s heavy, but it’s a one-time, high-impact step.

The prompt

Show the prompt
This repository contains the product documentation.

In /prd you'll find the existing PRDs.
In /architecture you'll find the existing technical docs, along with paths
to the service codebases that this product relies on.

Your task:
- Read the existing PRDs in /prd and the tech docs in /architecture to
  understand what is already documented.
- Analyze the codebases referenced in /architecture.
- Cross-reference the code against the existing documentation and identify
  everything that is not yet documented — frontend pages, backend handlers,
  services, jobs, integrations, and any other significant parts of the code.

End goal: a fully documented product. Produce a gap report for the PM that
clearly shows what currently has no documentation coverage.

Report format — output a Markdown document with the following sections:

1. Summary — total components found, how many are documented vs.
   undocumented, and overall documentation coverage (%).

2. Coverage table — one row per component:
   Component | Type (FE page / BE handler / service / job / integration) |
   Source path | Documented? (Yes / Partial / No) | Existing doc reference |
   Priority (High / Med / Low) | Notes

3. Undocumented items, grouped by service or domain, each with a one-line
   description of what it does (inferred from the code).

4. Recommendations — a prioritized list of what to document first and why.

For priority: user-facing and externally-exposed components rank highest,
internal/utility code lowest. If unsure whether something is documented,
mark it Partial and explain why in Notes.

A human pass on the report

The agent infers what undocumented code does by reading it, and it gets some of that wrong — so the report is a draft, not the verdict. The PM and tech lead walk it once: confirm priorities, fix misreadings, sign off. This matters more than it looks: research on agent context files shows that naïve, unverified context can actually lower task success while raising cost by 20%+. Bad context is worse than none. Work through the confirmed gaps and you have a fully documented product.

Keeping the docs alive

Docs rot — that’s why your Confluence was incomplete in the first place, and auto-generated docs drift the same way. Studies of large-scale agent setups call specification staleness the primary failure mode: the docs stop matching reality and the agent confidently builds on a lie.

The fix is a process rule, not another cleanup: updating the PRD and tech docs is part of the Definition of Done. If a change touches behavior, the doc changes in the same PR. The gap analysis happens once; after that the docs stay current because keeping them current is part of shipping.

The steps start to automate themselves

With product context in place, the steps automate one by one.

Start with PRD review — it’s where leverage is highest, because requirements is the cheapest place to catch a problem. A defect caught here is minutes to fix; the same defect in production costs an order of magnitude more. The PM submits a PRD, the agent reviews it against full product context, flags the gaps, and asks clarifying questions.

Tech design comes next: the agent reviews it the same way, or drafts it, since it already knows the services. From there the leverage runs down the line — task breakdown from the design, test cases from the PRD. I won’t enumerate them; the point is that context unlocks all of it. Without it, AI helps at step 4. With it, everywhere.

A concrete example

Here’s what it looked like on one of our products.

  • Scope: 10 services, ~40 frontend pages, ~130 backend handlers.
  • Gap analysis: the agent flagged ~90 undocumented components — about 65% of the product — at a cost of roughly 2 hours of agent time plus 4 hours of PM + tech-lead review.
  • Closing the gaps: 4 days, folded into normal sprint work.
  • After: a PRD review that took half a day of back-and-forth now takes ~1 hour — the agent returns gaps and clarifying questions on the first pass. Across the first 5 feature PRDs, it caught 6 ambiguities or missing cases that would previously have surfaced in QA or production.

Setup cost about a week of mostly part-time work. Each PRD since saves roughly half a day and catches issues that used to cost far more downstream.

Does this generalize?

The research says yes — and it lands first right at PRD review. An industrial study across three real-world requirements datasets found LLMs improved detection of ambiguous requirements by ~20%, with experts rating the AI’s explanations 3.84 out of 5 — exactly the PRD-review job. And the gains track context, not model size: in a five-model study, a structured prompt instead of a bare one lifted requirement quality by up to +59%. The model is rarely the bottleneck. Context quality is.

The takeaway

Before you automate anything, you have to create context for the agents. It isn’t a nicety — it’s the mandatory first step of AI-driven development. Get it right, keep it current, and every step from review to release becomes something AI can help with, not just step 4.

Sources

  • Requirements Ambiguity Detection and Explanation with LLMs: An Industrial Study (ICSME 2025) — the ~20% improvement and 3.84/5 expert rating. ipr.mdu.se
  • Exploring the Use of LLMs for Requirements Extraction from User Stories (2025) — structured prompts improving requirement quality. researchsquare.com
  • Knowledge Activation: AI Skills as the Institutional Knowledge Primitive for Agentic Software Development (2026) — the “naïve context lowers success” finding and specification staleness as the primary failure mode. arxiv.org
  • Cost-of-defect data (NIST, Capers Jones, CISQ) — contextqa.com and scopemaster.com