Datanommer: Daily exports

Requesting infra feedback here on a change I’m working on locally, before I go implement it in a pull request.

Current state

We perform a full database dump of datanommer every night and publish it. This is incredibly useful to Data WG, as we can spin up a replica and hammer it with obscenely unoptimized queries.

Issue

Our replica is stuck at the point in time of the last download. When we inevitably want newer data days/weeks later, we have to again download the entire database.

  • This incurs unnecessary bandwidth load on Fedora Infra.
    • That load is going to increase if I build a process on my end to pull the entire database nightly.
  • It takes ~15 hours on my (aging) hardware to restore the entire database from scratch when all I really need is to “top off” the replica.
  • “All or nothing” makes it impossible for anyone with <1.5TB of free disk space (and growing!) to participate in analytics with their own replica.

Solution

Alongside the nightly full database export, let’s export data from “yesterday” with a date in the filename. This would allow any replicas to catch up.

  • We can tack this on to the end of the current cronjob.
  • Compared to the full export this will be a tiny dataset, so it should incur minimal production load.
    • Edit: Confirmed via Zabbix on a manual run: the load is completely invisible in CPU and disk I/O. It’s noticeable in network out (since there’s very little out from this box typically), but the whole thing completes in < 1 minute.
  • Let’s publish this at the current location, perhaps under a datanommer-incremental directory.
  • We’ll also implement a retention scheme, so that the nightly job auto-deletes data older than X days. An average day is around 50MB compressed, so 15 days would use ~750MB of storage.

Implementation

Since pg_dump can’t dump time ranges, we will use a bash script and psql \COPY.

Status

I have a working POC on my infra. I’m still running a few more tests to settle some details, but I’ll be ready to upstream it within another day or so.

This will be a single bash script. Since I don’t know Fedora Infra deeply, I presume there will be a few minor tweaks necessary but should otherwise be simple to merge.

Ops Burden

I don’t expect this to incur any additional ops work aside from the initial merge.

Worst case, if the prod datanommer somehow fell behind but then caught up, we might have produced a “yesterday” backup which was incomplete. We could then re-run this by running the bash script by hand. I’ll add flags so that you can manually choose the date to export.

Alternatives

There are lots of ways to keep replicas updated, but this approach requires the fewest changes.

Feedback?

1 Like

Forgive my lack of understanding of fedora infra, but would it be better to have the thing being published daily be the same as the “delta” datasets?

I.e. for an initial setup youd do the same “download everything” process and then the nightly update can involve just ingesting that days new events.

Then potentially people wanting to do queries could maybe use a daily delta dump as a way to develop queries.

I guess the issue is technical fessibility of separating this data out. I think it could maybe outweigh the enormous cost of storing two days of daily 1.5TB copies in addition to the live database, but im not aure how much work that would be for the infra folks

That’s exactly what I’m proposing. I’m not sure where we’re miscommunicating, but today there is no publicly-available delta.

Oh im probably just less plugged in is all.

Are the full db dumps public? If not i would imagine there may be data privacy hurdles to jump

Per OP…

This links directly to the dump, which is 88 gigs compressed, but (as mentioned in the infra docs) is extremely large uncompressed.

Yeah, zstd lzma is amazing. Uncompressed, the dump is ~1.5TB, which matches roughly the size of the database after it’s been ingested and indexed.

Notable though is that even at 88Gb this file takes 2 hours to download from a dedicated server in the EU :disappointed_face: . Plus (I was wrong previously) around 15 hours to restore. So 17 hours of processing every 24 hours …

I heard various conversations over in bootc land about zstd:chunked a while ago.

Would that be useful? Publish the full dump as a series/collection of chunks?

I think this sounds perfectly reasonable. But yeah, we should try and publish whatever things are most useful to you all as @moralcode asked.