Governance

Nightly data-catalog crawler into Collibra

One of my first projects here. Loads thousands of datasets into the governance catalog on its own, every night.

Problem

The company's datasets were spread across dozens of scheduled jobs with no single, current inventory. Cataloging them by hand did not scale, and whatever you wrote down went stale almost immediately.

What I built

I wrote a crawler that walks the job-scheduler service to find every data stream and the datasets it produces, along with columns, owners, and lineage. A second job reads the latest snapshot with Spark and syncs it to Collibra over the REST API: streams, datasets, columns, and the relationships between them. It runs daily on our scheduler, tags everything it creates, and prunes assets that no longer exist so the catalog matches reality. Over time I cut its API error rate from about 700 to about 30 per run and its runtime from roughly 34 hours to 13.

Scope

A nightly crawler that keeps the governance catalog current. One of the first things I built here.

My role

I built both halves, the crawler and the Spark importer, and made it reliable enough to run unattended.

Architecture

A crawler walks the job-scheduler service for every data stream and the datasets it produces, with columns, owners, source, and lineage, writing a dated snapshot.
A Spark importer reads the latest snapshot and syncs streams, datasets, columns, and their relationships to the governance catalog over REST.
It tags everything it creates and prunes only its own tagged assets, so the catalog matches reality without touching anything a human added.
A status thread prints counters every few minutes and a rolling debug buffer flushes on failure.

Outcomes

Loads thousands of assets into the catalog on its own, every night.
Cut API errors from about 700 to about 30 per run.
Cut runtime from roughly 34 hours to 13.

Stack

PythonSparkCollibraHDFSREST APIs