Flagship platform
Production data lakehouse on Dremio + Kubernetes
A production-critical service at 99.9% uptime that a dozen teams across the company query directly.
Problem
The company needed one place to query data spread across on-prem HDFS, object storage, Snowflake, and relational databases, without copying it around first. It had to be reliable enough to run customer-facing reporting on.
What I built
I took Dremio from a single-VM proof of concept to a full platform on Kubernetes across dev, staging, and prod. A maintained Helm chart, GitLab CI/CD, F5 ingress, RBAC, and secrets that re-sync from a vault on every deploy. I added an Apache Iceberg catalog, a semantic layer that joins 30+ sources across six backend types in a single query, SSO, query-acceleration reflections, and dedicated compute engines. I run the upgrades, the on-call rotation, and the HA/DR posture.
Scope
The company's production lakehouse: one SQL engine over data spread across on-prem and cloud, queryable without copying it first. Customer-facing reporting runs on it.
My role
I own it end to end: the Kubernetes platform, the catalog, the upgrades, the on-call rotation, and the HA/DR posture.
Architecture
- Dremio on Kubernetes across dev, staging, and prod, deployed from a maintained Helm chart through GitLab CI/CD.
- An Apache Iceberg catalog over object storage, plus federated connections to HDFS, Hive, Snowflake, Postgres, and Mongo.
- A semantic layer that joins 30+ sources across six backend types in a single query.
- F5 ingress, RBAC, SSO, and secrets that re-sync from a vault on every deploy.
- Query-acceleration reflections and dedicated compute engines for predictable performance.
- Prometheus, Thanos, and Grafana for metrics and alerting.
Outcomes
- Runs as a production-critical service at 99.9% uptime.
- A dozen teams query it directly, self-serve.
- Performance tuning took multi-minute queries down to seconds.
- A written HA/DR posture: RTO/RPO targets, coordinator HA, backups, and cross-region storage.