I Built an AI Data Steward. The Hard Part Wasn't the AI.
A pipeline that documents a data catalog with an LLM sounds like a prompt-engineering problem. Almost none of my time went to prompts. It went to making the boring parts trustworthy.
Writing
Write-ups from building data infrastructure. Mostly the bugs that cost me an afternoon, and what I changed so they wouldn't cost the next person one.
A pipeline that documents a data catalog with an LLM sounds like a prompt-engineering problem. Almost none of my time went to prompts. It went to making the boring parts trustworthy.
We wired group-based admin access through OIDC, granted the right group, and nobody got in. The token was the problem, and no amount of server config could fix it.
An Iceberg REST catalog can hand short-lived storage credentials to query engines. That feature quietly assumes your object store implements one specific AWS STS call. Ours didn't.
A lakehouse is only useful if people can see the data. I run Apache Superset over ours in staging and prod, and wired up a one-click generator that builds a starter dashboard for any dataset and links it from the catalog.
A platform nobody can find doesn't get used. I built the team's internal portal so people could discover data products, read the docs, and try a query, all behind SSO and shipped on Kubernetes.
Our internal services spoke an RPC dialect and made you sign in three times just to make one call. I designed the data mesh API layer and built MuleSoft proxies that turn all of it into a single authenticated REST request.
One of the first things I built. The company's datasets lived across dozens of scheduled jobs with no current inventory, so I wrote a crawler that finds them all and syncs them into the governance catalog on its own, every night.