Skip to content

AI / Platform

AI data steward + MCP server

Documents a data catalog on a daily run and answers catalog questions from an engineer's editor.

Problem

Hundreds of datasets, almost no docs, and the same three questions every week: who owns this, is it fresh, what does this column mean.

What I built

A daily pipeline documents a catalog with an LLM. A cheap freshness check runs before any expensive generation, so cost tracks change. Around it: an observability layer for slow-query detection, permission auditing, and catalog-health checks with chat alerts, state kept in an Iceberg table, and a conversational MCP server engineers query from their editor. It also ships a push-button Superset dashboard generator and a knowledge graph built over lineage and query patterns.

Scope

An AI tool that documents a data catalog and answers questions about it. Runs against a development catalog with sample data today.

My role

I designed and built it: the pipeline, the cost model, the observability, and the MCP server.

Architecture

  • A daily pipeline documents datasets with an LLM; a cheap freshness check runs before any expensive generation, so cost tracks change rather than table count.
  • A strict output contract rejects malformed model output before it can corrupt a page.
  • Tags union-merge with manual ones; state is kept in an Iceberg table for idempotent, run-scoped writes.
  • A semantic layer and knowledge graph derived from lineage and query history ground each description.
  • Observability: bronze tables roll up into a gold daily summary feeding a Superset dashboard; health checks alert to chat only when a human is needed.
  • A read-only MCP server lets engineers query the catalog from their editor.
  • A hard cluster guardrail keeps it off production data by default.

Outcomes

  • Documents a catalog on a daily run with cost proportional to change.
  • Conversational, read-only catalog access from the editor over MCP.
  • The production guardrail was built before there was any clearance to touch real data.

Stack

PythonLLMMCPApache IcebergSupersetPrometheus

From this work

What I learned — I Built an AI Data Steward. The Hard Part Wasn't the AI.

A pipeline that documents a data catalog with an LLM sounds like a prompt-engineering problem. Almost none of my time went to prompts. It went to making the boring parts trustworthy.

Read the lesson