Projects

What I've built

Most of my work is data platform infrastructure. Here it is at a level I can share publicly: the problem, what I built, and the stack.

Platform work

01
Flagship platform
Production data lakehouse on Dremio + Kubernetes
A production-critical service at 99.9%+ uptime that a dozen teams across the company query directly.
Problem
The company needed one place to query data spread across on-prem HDFS, object storage, Snowflake, and relational databases, without copying it around first. It had to be reliable enough to run customer-facing reporting on.
What I built
I took Dremio from a single-VM proof of concept to a full platform on Kubernetes across dev, staging, and prod. A maintained Helm chart, GitLab CI/CD, F5 ingress, RBAC, and secrets that re-sync from a vault on every deploy. I added an Apache Iceberg catalog, a semantic layer that joins 30+ sources across six backend types in a single query, SSO, query-acceleration reflections, and dedicated compute engines. I run the upgrades, the on-call rotation, and the HA/DR posture.
DremioApache IcebergKubernetesHelmHDFSSnowflakeGrafana
Read the write-up
02
Streaming
Kafka platform on Confluent for Kubernetes
Self-hosted streaming with end-to-end auth. Topics land in the lakehouse as Iceberg tables.
Problem
Teams needed real-time event streaming and a clean path from Kafka topics into the lakehouse. Self-hosted, secured, observable.
What I built
I built a streaming platform on Confluent for Kubernetes in KRaft mode, so no ZooKeeper. OAuth and RBAC run through an external identity provider. A schema registry sits behind TLS, custom Connect images carry an Iceberg sink and Debezium CDC, and connectors write topics into the Iceberg catalog. Ingress, DNS, metrics, and Grafana dashboards came with it.
Apache KafkaConfluentKRaftOAuthDebeziumApache Iceberg
Read the write-up
03
AI / Platform
AI data steward + MCP server
Documents hundreds of datasets on its own and answers catalog questions from an engineer's editor.
Problem
Hundreds of datasets, almost no docs, and the same three questions every week: who owns this, is it fresh, what does this column mean.
What I built
A daily pipeline documents the catalog with an LLM. A cheap freshness check runs before any expensive generation, so cost tracks change. Around it: an observability layer for slow-query detection, permission auditing, and medallion-health checks with chat alerts, state kept in an Iceberg table, and a conversational MCP server engineers query from their editor. It also ships a push-button Superset dashboard generator and a knowledge graph over lineage and engine telemetry.
PythonLLMMCPApache IcebergSupersetPrometheus
Read the write-up
04
BI platform
Self-serve BI on Apache Superset
The BI layer over the lakehouse, plus one-click dashboards generated straight from the data catalog.
Problem
People wanted to explore lakehouse data and build dashboards without filing a ticket or learning the query tools underneath.
What I built
I run Apache Superset in staging and production on Kubernetes: a forked Helm chart, a custom image with the Dremio driver, SSO, and our dashboard assets baked in, and a Helmfile pipeline that ships both environments. It connects straight to the lakehouse for exploratory analysis. On top of it, a one-click generator builds a starter dashboard for any dataset and drops the link into that dataset's catalog page, so finding a dataset and seeing it charted are one hop apart.
Apache SupersetDremioKubernetesHelmHelmfile
Read the write-up
05
Full-stack
Internal data platform portal + observability
The platform's front door: discover data products, read the docs, see that it is healthy.
Problem
People needed one place to discover data products, read docs, and check that the platform was up.
What I built
I designed and shipped the team's internal web portal end to end. SSO, deployed on Kubernetes with CI and a Docker build, a browsable data-product catalog, docs, and a page to query data through Dremio. Next to it I stood up self-hosted analytics and Grafana/Prometheus dashboards built from JMX metrics, plus governance work in the data catalog.
Next.jsKubernetesGrafanaPrometheusCollibra
Read the write-up
06
APIs & integration
API gateway + MuleSoft proxies over internal services
One authenticated REST call in place of a three-step sign-in against legacy RPC services.
Problem
Internal backend services spoke an RPC dialect through a bridge, and calling one meant a three-step auth sequence: get an OAuth token, sign in to mint a JWT, then attach that JWT to every downstream call. Every consumer reimplemented it, usually wrong.
What I built
I designed the data mesh API layer and built MuleSoft proxy APIs over three internal services: accounts and auth, the job scheduler, and ad inventory. The proxy does the OAuth exchange and JWT minting server-side, forwards to the right backend, and normalizes the bridge's responses into plain JSON, so a consumer sends one REST request instead of three. Shipped across dev, QA, and prod with GitLab CI/CD and per-environment routing. I also prototyped a unified GraphQL schema for sourcing and joining data products by field, and a proxy over the Snowflake SQL API as a single entry point.
MuleSoftRESTGraphQLOAuth / OIDCApache ThriftPython
Read the write-up
07
Governance
Nightly data-catalog crawler into Collibra
One of my first projects here. Loads thousands of datasets into the governance catalog on its own, every night.
Problem
The company's datasets were spread across dozens of scheduled jobs with no single, current inventory. Cataloging them by hand did not scale, and whatever you wrote down went stale almost immediately.
What I built
I wrote a crawler that walks the job-scheduler service to find every data stream and the datasets it produces, along with columns, owners, and lineage. A second job reads the latest snapshot with Spark and syncs it to Collibra over the REST API: streams, datasets, columns, and the relationships between them. It runs daily on our scheduler, tags everything it creates, and prunes assets that no longer exist so the catalog matches reality. Over time I cut its API error rate from about 700 to about 30 per run and its runtime from roughly 34 hours to 13.
PythonSparkCollibraHDFSREST APIs
Read the write-up

Operations

What I run on Kubernetes

The platforms and services I own and operate across two clusters, staging and production.

Dremio
Lakehouse query engine
devstagingprod
Iceberg / Open Catalog
Lakehouse catalog
devstagingprod
Confluent Kafka
Streaming platform (CFK)
staging
Apache Superset
BI & dashboards
stagingprod
Data steward MCP
AI catalog assistant
devstaging
Dremio MCP
Editor-native SQL
devstaging
Confluent MCP
Kafka admin tools
devstaging
Internal data portal
Discovery & docs
stagingprod

Personal

Outside the day job

Jukeblox

A game first built in 48 hours for the GMTK Game Jam 2023, with a fair amount of polish added since.

Mountain bike edits

A creative outlet. I edit and share helmet-cam recordings of rides around the Bay Area.

Production data lakehouse on Dremio + Kubernetes

Kafka platform on Confluent for Kubernetes

AI data steward + MCP server

Self-serve BI on Apache Superset

Internal data platform portal + observability

API gateway + MuleSoft proxies over internal services

Nightly data-catalog crawler into Collibra

What I run on Kubernetes

Outside the day job

Jukeblox

Mountain bike edits