Projects

What I've built

Most of my work is data platform infrastructure. Here it is at a level I can share publicly: the problem, what I built, and the stack.

Platform work

  1. 01

    Flagship platform

    Production data lakehouse on Dremio + Kubernetes

    A production-critical service at 99.9%+ uptime that a dozen teams across the company query directly.

    Problem
    The company needed one place to query data spread across on-prem HDFS, object storage, Snowflake, and relational databases, without copying it around first. It had to be reliable enough to run customer-facing reporting on.
    What I built
    I took Dremio from a single-VM proof of concept to a full platform on Kubernetes across dev, staging, and prod. A maintained Helm chart, GitLab CI/CD, F5 ingress, RBAC, and secrets that re-sync from a vault on every deploy. I added an Apache Iceberg catalog, a semantic layer that joins 30+ sources across six backend types in a single query, SSO, query-acceleration reflections, and dedicated compute engines. I run the upgrades, the on-call rotation, and the HA/DR posture.
    DremioApache IcebergKubernetesHelmHDFSSnowflakeGrafana
    Read the write-up
  2. 02

    Streaming

    Kafka platform on Confluent for Kubernetes

    Self-hosted streaming with end-to-end auth. Topics land in the lakehouse as Iceberg tables.

    Problem
    Teams needed real-time event streaming and a clean path from Kafka topics into the lakehouse. Self-hosted, secured, observable.
    What I built
    I built a streaming platform on Confluent for Kubernetes in KRaft mode, so no ZooKeeper. OAuth and RBAC run through an external identity provider. A schema registry sits behind TLS, custom Connect images carry an Iceberg sink and Debezium CDC, and connectors write topics into the Iceberg catalog. Ingress, DNS, metrics, and Grafana dashboards came with it.
    Apache KafkaConfluentKRaftOAuthDebeziumApache Iceberg
    Read the write-up
  3. 03

    AI / Platform

    AI data steward + MCP server

    Documents hundreds of datasets on its own and answers catalog questions from an engineer's editor.

    Problem
    Hundreds of datasets, almost no docs, and the same three questions every week: who owns this, is it fresh, what does this column mean.
    What I built
    A daily pipeline documents the catalog with an LLM. A cheap freshness check runs before any expensive generation, so cost tracks change. Around it: an observability layer for slow-query detection, permission auditing, and medallion-health checks with chat alerts, state kept in an Iceberg table, and a conversational MCP server engineers query from their editor. It also ships a push-button Superset dashboard generator and a knowledge graph over lineage and engine telemetry.
    PythonLLMMCPApache IcebergSupersetPrometheus
    Read the write-up
  4. 04

    BI platform

    Self-serve BI on Apache Superset

    The BI layer over the lakehouse, plus one-click dashboards generated straight from the data catalog.

    Problem
    People wanted to explore lakehouse data and build dashboards without filing a ticket or learning the query tools underneath.
    What I built
    I run Apache Superset in staging and production on Kubernetes: a forked Helm chart, a custom image with the Dremio driver, SSO, and our dashboard assets baked in, and a Helmfile pipeline that ships both environments. It connects straight to the lakehouse for exploratory analysis. On top of it, a one-click generator builds a starter dashboard for any dataset and drops the link into that dataset's catalog page, so finding a dataset and seeing it charted are one hop apart.
    Apache SupersetDremioKubernetesHelmHelmfile
    Read the write-up
  5. 05

    Full-stack

    Internal data platform portal + observability

    The platform's front door: discover data products, read the docs, see that it is healthy.

    Problem
    People needed one place to discover data products, read docs, and check that the platform was up.
    What I built
    I designed and shipped the team's internal web portal end to end. SSO, deployed on Kubernetes with CI and a Docker build, a browsable data-product catalog, docs, and a page to query data through Dremio. Next to it I stood up self-hosted analytics and Grafana/Prometheus dashboards built from JMX metrics, plus governance work in the data catalog.
    Next.jsKubernetesGrafanaPrometheusCollibra
    Read the write-up
  6. 06

    APIs & integration

    API gateway + MuleSoft proxies over internal services

    One authenticated REST call in place of a three-step sign-in against legacy RPC services.

    Problem
    Internal backend services spoke an RPC dialect through a bridge, and calling one meant a three-step auth sequence: get an OAuth token, sign in to mint a JWT, then attach that JWT to every downstream call. Every consumer reimplemented it, usually wrong.
    What I built
    I designed the data mesh API layer and built MuleSoft proxy APIs over three internal services: accounts and auth, the job scheduler, and ad inventory. The proxy does the OAuth exchange and JWT minting server-side, forwards to the right backend, and normalizes the bridge's responses into plain JSON, so a consumer sends one REST request instead of three. Shipped across dev, QA, and prod with GitLab CI/CD and per-environment routing. I also prototyped a unified GraphQL schema for sourcing and joining data products by field, and a proxy over the Snowflake SQL API as a single entry point.
    MuleSoftRESTGraphQLOAuth / OIDCApache ThriftPython
    Read the write-up
  7. 07

    Governance

    Nightly data-catalog crawler into Collibra

    One of my first projects here. Loads thousands of datasets into the governance catalog on its own, every night.

    Problem
    The company's datasets were spread across dozens of scheduled jobs with no single, current inventory. Cataloging them by hand did not scale, and whatever you wrote down went stale almost immediately.
    What I built
    I wrote a crawler that walks the job-scheduler service to find every data stream and the datasets it produces, along with columns, owners, and lineage. A second job reads the latest snapshot with Spark and syncs it to Collibra over the REST API: streams, datasets, columns, and the relationships between them. It runs daily on our scheduler, tags everything it creates, and prunes assets that no longer exist so the catalog matches reality. Over time I cut its API error rate from about 700 to about 30 per run and its runtime from roughly 34 hours to 13.
    PythonSparkCollibraHDFSREST APIs
    Read the write-up

Operations

What I run on Kubernetes

The platforms and services I own and operate across two clusters, staging and production.

  • Dremio

    Lakehouse query engine

    devstagingprod
  • Iceberg / Open Catalog

    Lakehouse catalog

    devstagingprod
  • Confluent Kafka

    Streaming platform (CFK)

    staging
  • Apache Superset

    BI & dashboards

    stagingprod
  • Data steward MCP

    AI catalog assistant

    devstaging
  • Dremio MCP

    Editor-native SQL

    devstaging
  • Confluent MCP

    Kafka admin tools

    devstaging
  • Internal data portal

    Discovery & docs

    stagingprod

Personal

Outside the day job