AI / Platform
AI data steward + MCP server
Documents a data catalog on a daily run and answers catalog questions from an engineer's editor.
Problem
Hundreds of datasets, almost no docs, and the same three questions every week: who owns this, is it fresh, what does this column mean.
What I built
A daily pipeline documents a catalog with an LLM. A cheap freshness check runs before any expensive generation, so cost tracks change. Around it: an observability layer for slow-query detection, permission auditing, and catalog-health checks with chat alerts, state kept in an Iceberg table, and a conversational MCP server engineers query from their editor. It also ships a push-button Superset dashboard generator and a knowledge graph built over lineage and query patterns.
Scope
An AI tool that documents a data catalog and answers questions about it. Runs against a development catalog with sample data today.
My role
I designed and built it: the pipeline, the cost model, the observability, and the MCP server.
Architecture
- A daily pipeline documents datasets with an LLM; a cheap freshness check runs before any expensive generation, so cost tracks change rather than table count.
- A strict output contract rejects malformed model output before it can corrupt a page.
- Tags union-merge with manual ones; state is kept in an Iceberg table for idempotent, run-scoped writes.
- A semantic layer and knowledge graph derived from lineage and query history ground each description.
- Observability: bronze tables roll up into a gold daily summary feeding a Superset dashboard; health checks alert to chat only when a human is needed.
- A read-only MCP server lets engineers query the catalog from their editor.
- A hard cluster guardrail keeps it off production data by default.
Outcomes
- Documents a catalog on a daily run with cost proportional to change.
- Conversational, read-only catalog access from the editor over MCP.
- The production guardrail was built before there was any clearance to touch real data.
Stack
From this work
A pipeline that documents a data catalog with an LLM sounds like a prompt-engineering problem. Almost none of my time went to prompts. It went to making the boring parts trustworthy.
Read the lesson