Requesting infra feedback here on a change I’m working on locally, before I go implement it in a pull request.
Current state
We perform a full database dump of datanommer every night and publish it. This is incredibly useful to Data WG, as we can spin up a replica and hammer it with obscenely unoptimized queries.
Issue
Our replica is stuck at the point in time of the last download. When we inevitably want newer data days/weeks later, we have to again download the entire database.
- This incurs unnecessary bandwidth load on Fedora Infra.
- That load is going to increase if I build a process on my end to pull the entire database nightly.
- It takes ~15 hours on my (aging) hardware to restore the entire database from scratch when all I really need is to “top off” the replica.
- “All or nothing” makes it impossible for anyone with <1.5TB of free disk space (and growing!) to participate in analytics with their own replica.
Solution
Alongside the nightly full database export, let’s export data from “yesterday” with a date in the filename. This would allow any replicas to catch up.
- We can tack this on to the end of the current cronjob.
- Compared to the full export this will be a tiny dataset, so it should incur minimal production load.
- Edit: Confirmed via Zabbix on a manual run: the load is completely invisible in CPU and disk I/O. It’s noticeable in network out (since there’s very little out from this box typically), but the whole thing completes in < 1 minute.
- Let’s publish this at the current location, perhaps under a
datanommer-incrementaldirectory. - We’ll also implement a retention scheme, so that the nightly job auto-deletes data older than X days. An average day is around 50MB compressed, so 15 days would use ~750MB of storage.
Implementation
Since pg_dump can’t dump time ranges, we will use a bash script and psql \COPY.
Status
I have a working POC on my infra. I’m still running a few more tests to settle some details, but I’ll be ready to upstream it within another day or so.
This will be a single bash script. Since I don’t know Fedora Infra deeply, I presume there will be a few minor tweaks necessary but should otherwise be simple to merge.
Ops Burden
I don’t expect this to incur any additional ops work aside from the initial merge.
Worst case, if the prod datanommer somehow fell behind but then caught up, we might have produced a “yesterday” backup which was incomplete. We could then re-run this by running the bash script by hand. I’ll add flags so that you can manually choose the date to export.
Alternatives
There are lots of ways to keep replicas updated, but this approach requires the fewest changes.