I Built an AI Data Steward. The Hard Part Wasn't the AI.

Our data catalog had hundreds of tables and almost no documentation. People asked the same questions in chat every week: who owns this, is it fresh, what does this column mean. So I built a pipeline that walks the catalog every day and writes a wiki page for each dataset using an LLM. It works, people use it, and the model is the part I think about least.

Here's where the time actually went.

Stop paying to regenerate things that didn't change

The naive version sends every table to a capable model every run. That's slow and the bill scales with your catalog, not with how much changed. Most tables don't change day to day.

I split it into two phases. A cheap, fast model gets the existing wiki and the current schema and answers one question: is this still accurate? If yes, the run skips it for free. Only when the cheap check says "stale" does the expensive model write a new page. On top of that, a local cache remembers a hash of each table's schema and the last time we looked, so unchanged tables don't even reach the cheap check until a refresh interval passes.

The result is that a normal run touches a handful of tables and costs almost nothing. A schema change or a new table pulls exactly those datasets back into generation. Cost tracks change, which is what you want.

Make the output a contract, not a vibe

An LLM will happily return prose when you wanted a document, or a cheerful "Sure, here's your wiki!" preamble that corrupts the page. So the generation step has a contract: the model must return either a page that starts with a heading, or a single sentinel string meaning "no change needed." Anything else is a contract violation and gets rejected before it touches the catalog.

That one rule removed a whole category of garbage. When the model gets chatty, the run fails loudly on that table instead of silently publishing junk. I added a small normalizer for the common near-misses, like the model writing "the existing page is accurate" instead of the sentinel, so a predictable rephrase doesn't fail the run.

Idempotency, because it runs every day

Anything that runs on a schedule will get run twice. A retry, a manual kick, an overlap. So every record the pipeline writes is keyed by a run id, and a re-run deletes that run's rows before inserting. Run it three times in an afternoon and you get one clean set of rows, not three. None of this is clever. It's the difference between a pipeline you trust and one you babysit.

The guardrail that matters most

The same code points at a development catalog and a production one, chosen by a flag. Early on, nothing stopped a stale config from sending a production run at the wrong target. So a run against anything but development now requires a second variable that has to match the target exactly. Mismatch or missing, and it exits before the first call. The scheduler sets it deliberately. I keep it out of the shared credentials file on purpose, because the whole point is that you can't reach production by accident.

Where the model fits

After all of that, the prompt is almost the easy part. It gets the schema, a few sample facts, and the previous page so the wording stays stable run to run. The interesting engineering is everything around it: deciding what to regenerate, enforcing the output shape, keeping writes idempotent, and refusing to run somewhere dangerous by default.

If you're putting an LLM into a pipeline, budget your time accordingly. The model is a component. The system is the work.