Running 100,000 API calls a day on a budget

The brief was simple to say and annoying to do: keep an eye on a set of upstream APIs, all day, every day, and keep the history so we could spot trends. “All day, every day” turned out to mean 100,000+ requests a day, and the history part meant roughly half a million time-series rows a day landing in the database.

Here’s what actually mattered when I built it.

The pipeline is mostly about failure

When you make a handful of API calls, the happy path is the whole story. When you make a hundred thousand, the happy path is a rounding error. Upstreams time out, rate-limit you, return half a response, or quietly start handing back stale data. The pipeline’s real job is not “fetch”, it’s “fetch, and keep going when fetching breaks.”

The pieces that earned their keep:

Proxy rotation. Spreading requests across a pool kept any single IP from getting throttled into uselessness.
Circuit breakers. When an upstream started failing, I stopped hammering it for a cooldown window instead of retrying into the void. This is the single biggest reliability win for the least code.
Retries with backoff. Not infinite, bounded, with jitter, so a recovering upstream doesn’t get a thundering herd the moment it comes back.
Dynamic concurrency. Concurrency went up when things were healthy and throttled itself down when error rates climbed. A fixed worker count is a footgun at this scale.

Two databases, two jobs

I split storage on purpose:

MongoDB for the structured, current-state documents, schema-less, forgiving of the messy semi-structured payloads upstreams actually return.
TimescaleDB for the time-series, 500k+ rows a day, queried by time window. A plain Postgres table would have been fine for a week and miserable by month three. Timescale’s hypertables made the “show me the last 30 days” queries stay fast.

Cron jobs scheduled the writes; a containerised Linux box ran the whole thing.

Watching it without watching it

The point of all this data was to see something, so the last mile was Grafana on top of the TimescaleDB, with PostgreSQL queries driving the panels. Once trends were a dashboard instead of a database query I had to remember to run, the system started earning its keep.

What I’d tell past me

Build the failure handling first, not last. The fetch loop took an afternoon. The circuit breakers, the backoff, the concurrency that backs off under pressure, that’s the part that let it run unattended for months. “Scalable” mostly means “degrades gracefully,” and graceful degradation is something you design in, not bolt on.