· 3 min read

Crawling a Data Catalog Into Collibra Every Night

This was one of the first projects I owned. The company had thousands of datasets, produced by dozens of scheduled jobs, and no single place that said what existed right now. There was a governance catalog (Collibra), but keeping it current by hand was hopeless. Someone would document a dataset, the pipeline would change a week later, and the entry was wrong again.

So instead of cataloging by hand, I made the catalog build itself.

Two jobs: crawl, then sync

I split it in two on purpose.

The crawler walks the job-scheduler service and asks it what runs. For each data stream it finds the datasets that stream produces, then pulls the columns, owners, source, and the upstream/downstream dependencies. It writes all of that to a dated snapshot. That part stays close to the source system and doesn't know anything about Collibra.

The importer reads the latest snapshot with Spark, builds an in-memory model, and pushes it to Collibra over the REST API: streams, the datasets under them, the columns under those, and the relationships between all three. Keeping the crawl and the upload separate meant I could re-run either one without touching the other, and debug the catalog logic without re-hitting the scheduler.

It runs daily on our job scheduler under a service account, no laptop involved.

The catalog has to match reality

Creating assets is the easy half. The hard half is deletion. If a dataset disappears from the source, a stale entry in the catalog is worse than no entry, because people trust it.

So every asset the script creates gets tagged as ours. At the end of a run, anything carrying that tag that the run didn't touch gets removed. That keeps the catalog honest without ever deleting something a human added by hand, because the script only ever prunes its own tagged assets. I learned that one the careful way.

Making it boring and fast

The first version worked and was miserable. It threw around 700 API errors per run and took about 34 hours.

Most of the errors came from sloppy string handling in the search calls and from treating transient failures as fatal. I tightened the request building, added retries with backoff on the calls that flake, and stopped trying to create relationships the API won't accept in bulk. Error count dropped to about 30. The runtime came down to roughly 13 hours with an in-memory cache for repeated lookups, batched deletes, and not re-fetching things I already had in memory.

I also added a status thread that prints API-call, error, and cache-hit counters every few minutes, plus a rolling debug buffer that flushes on failure. When a daily job runs unattended, you want it to tell you how it's doing before it tells you it died.

What it taught me

This is the project that taught me the platform: how the streams worked, where the data lived, how governance actually gets used. It's not glamorous. It's a cron job that reconciles two systems. But it turned a catalog nobody trusted into one that was right every morning, and most of the engineering was in the unglamorous parts: idempotency, careful deletes, and failing loud.