We shipped a 4-source outbreak ETL. Two days later, we kept one.

We shipped a 4-source outbreak ETL. Two days later, we kept one.
On Thursday afternoon we merged a Node.js ETL that pulled disease-outbreak data from PAHO, CDC, ECDC, and WHO, cross-merged the four feeds against a country-routing table, and uploaded the result to Supabase Storage for the Flutter app to consume. By Saturday morning we had deleted three of the four sources and 2,383 lines of code along with them.
This is what we learned about scraping public-health dashboards, and why we should have started where we finished.
The problem
Tempy needed an outbreak feed. The plan: ingest from the authoritative regional agencies, normalize into one schema, ship to clients. We wrote the routing table the way a public-health PM would:
// lib/features/outbreak/region_source_routing.dart (now deleted)
static const Set<String> _pahoCountries = {
'BR', 'MX', 'AR', 'CL', 'CO', 'PE', 'VE', 'EC', 'BO',
'UY', 'PY', 'GY', 'SR', 'GF',
};
static const Set<String> _ecdcCountries = {
'DE', 'FR', 'ES', 'IT', 'NL', 'BE', 'AT', 'PT', 'SE',
'NO', 'DK', 'FI', 'IE', 'PL', 'CZ', 'HU', 'GR', 'RO', 'BG',
};
static const Set<String> _cdcCountries = {'US', 'CA'};
static OutbreakSource primarySourceFor(String countryCode) {
final code = countryCode.toUpperCase();
if (_pahoCountries.contains(code)) return OutbreakSource.paho;
if (_ecdcCountries.contains(code)) return OutbreakSource.ecdc;
if (_cdcCountries.contains(code)) return OutbreakSource.cdc;
return OutbreakSource.who;
}
LATAM → PAHO. EU → ECDC. US/CA → CDC. Everywhere else → WHO. Authoritative regional data wherever it exists; WHO as the global fallback. Clean. Sensible. Wrong.
Why it happened
The original design assumed dashboard endpoints were stable APIs. They aren't. Within three days of first release, every "primary" source's endpoint had broken in a different way:
- PAHO (
www3.paho.org) — upstream502 Bad Gatewayacross the entire Joomla data portal we scraped. Their site goes intermittently dark and there's no code fix from our side. - CDC FluView (
gis.cdc.gov/grasp/flu2) — silently deprecated. ThePOSTendpoint still returnsHTTP 200with a valid empty ZIP for every season ID we try. The data lives on newer dashboards CDC hasn't documented yet. - ECDC CDTR (
www.ecdc.europa.eu) — WAF now blocks non-browser traffic. User-Agent spoofing doesn't bypass it; we'd need a headless-browser farm to keep up.
Each failure had a different shape — 502 we'd notice, empty 200 OK we wouldn't, 403 WAF we'd misread as transient. The cron ran on schedule. Supabase Storage got new files every Saturday morning. The pipeline reported green. It was shipping zero rows.
What we tried first
The instinct was to patch the scrapers — longer retries for PAHO, fresh User-Agents for CDC, a Playwright runner for ECDC's WAF. Then we costed it: a self-hosted runner and three brittle scrapers, maintenance forever, all for a feed parents would look at twice a year. Then someone said the thing nobody had said in the design review: WHO publishes an OData API.
The fix
WHO's Disease Outbreak News endpoint is an officially documented OData v4 REST API. No auth, no rate-limit games, no HTML to parse. Every DON since 2004 lives there, with stable fields:
// tools/outbreak-etl/src/sources/who/download.ts (new)
const params = new URLSearchParams({
$filter: `PublicationDate ge ${sinceIso}`,
$orderby: "PublicationDate desc",
$select:
"Id,DonId,Title,PublicationDate,PublicationDateAndTime," +
"UrlName,ItemDefaultUrl,Overview,Summary,Epidemiology,Assessment,Advice",
});
A server-side $filter on PublicationDate cuts the payload from ~270KB (all 600 DONs ever published) to under 30KB for the last 90 days. The schema doesn't drift — adding a column doesn't 404 the endpoint. And critically, one source covers every country. National agencies often publish after WHO, not before; if it's serious enough to surface in a DON, it's already authoritative.
The cleanup was ruthless. The collapse refactor deleted, in a single commit: the three primary-source crawlers (sources/paho/, sources/cdc/, sources/ecdc/) and their tests; lib/merge.ts (183 lines of cross-source conflict resolution that only mattered with more than one source); lib/region_routing.ts and lib/severity_thresholds.ts; the Flutter routing table from the top of this post; and the csv-parse and pdf-parse dependencies that were only used by the deleted crawlers.
In their place we added one folder — sources/who/{download,normalize,country_lookup,index}.ts — that does the same job for every country at once. The OutbreakSource enum dropped from four variants to one:
enum OutbreakSource {
who;
/// Tolerant parser: any legacy `paho`/`ecdc`/`cdc` strings written
/// by the previous ETL deserialize as `who`. Those buckets are no
/// longer populated, but cached payloads on user devices may still
/// contain them — silently re-tagging avoids a deserialize crash
/// on the first launch after upgrade.
static OutbreakSource fromJson(String value) {
return OutbreakSource.who;
}
}
That tolerant parser is the one piece of complexity we kept. Users who'd opened the app during the broken-pipeline days had cached payloads tagged source: paho on their devices. We weren't going to crash anyone's first launch after the upgrade just to be honest about the schema.
Before and after
| 4-source (Thu) | WHO-only (Sat) | |
|---|---|---|
| Source files (ETL) | paho/, cdc/, ecdc/, plus merge.ts, region_routing.ts, severity_thresholds.ts |
who/ only |
| Net delta of the Saturday refactor | — | +1,005 / −3,388 (−2,383 lines) |
| Production runs that produced data | 0 | first smoke test: 8 DONs → 9 country rows across 5 country files in under 2 seconds |
| Test suite | 93 (37 specific to dead scrapers) | 44 (all outbreak-relevant) |
| Client routing | per-country primarySourceFor + primary → WHO fallback |
fetch who/{COUNTRY}.json and who/_GLOBAL.json in parallel, merge |
The client-side change is the part we like best. Instead of "primary first, WHO if primary is empty," the Flutter repository fetches the country file and a _GLOBAL sentinel in parallel. Multi-country DONs (a mpox global situation, a hantavirus cruise outbreak that spans flags) land in _GLOBAL.json and reach every user regardless of their selected country. No routing, no fallback chain, just a merge.
What we learned
Graceful fallback is camouflage when the foreground is broken. Our pipeline reported green for two days while shipping zero rows. Every scraper "succeeded" — PAHO returned an empty array (because 502), CDC returned an empty ZIP, ECDC returned a WAF-blocked HTML page that our PDF parser politely failed to extract from. The orchestrator merged four empty lists and called it a successful run. The fix wasn't more retry; it was a smoke check that asserts non-zero output for any source we expect data from.
Dashboard URLs are not APIs, even when they return JSON. PAHO's Joomla portal, CDC's gis.cdc.gov ArcGIS proxy, ECDC's weekly PDF — none of these are contracts we can rely on. The OData endpoint at WHO is an API, with a spec, that we can plan around.
The cross-source machinery had no customer. mergeConflicting() was 183 lines defending an invariant ("regional sources beat global, even when louder") against a scenario that never happened. We wrote it because the four-source design implied it. With one source, every line became dead weight. The sharper version of YAGNI: every architectural component should have a named failure mode it prevents. If we'd asked "what does merge.ts do when PAHO and ECDC overlap on a country?" we'd have realized the answer was "we don't have any country in both sets."
What's next
The WHO normalizer recognizes about 50 country names from DON titles via a hand-maintained lookup; misses route to _GLOBAL — degraded precision, no data loss. We'll grow it as new countries appear. Severity inference still defaults conservative — we'd rather under-call than over-call to a parent at 3 AM. And we're keeping a quiet eye on the dashboards we left behind: if PAHO comes back, if CDC publishes a real API, we'd add them — but as enrichment, not as primaries. WHO is the spine now.
Try Tempy
Tempy is a calm, offline-first fever log for parents — built so it survives 3 AM.
Frequently Asked Questions
Why did the Tempy team switch from a 4-source ETL to a single WHO source?
The original 4-source ETL faced frequent failures due to unstable or deprecated endpoints from PAHO, CDC, and ECDC. WHO's Disease Outbreak News API provided a stable, documented OData v4 REST API with consistent schema and coverage for all countries, simplifying maintenance and improving reliability.
What challenges did the team encounter when scraping data from PAHO, CDC, and ECDC?
PAHO's site experienced intermittent 502 errors, CDC deprecated their FluView endpoint returning empty data, and ECDC's Web Application Firewall blocked non-browser traffic. These issues caused silent failures that the pipeline did not detect, leading to zero data being shipped despite successful runs.
How does the WHO Disease Outbreak News API improve data ingestion for Tempy?
WHO's API offers a stable, authenticated-free OData v4 REST interface with server-side filtering and a consistent schema that doesn't break with added fields. It covers all countries in a single source, reducing complexity and ensuring authoritative, timely outbreak data.
What changes were made to the client-side data fetching after switching to the WHO-only source?
The client now fetches country-specific outbreak files and a global sentinel file in parallel, merging them without complex routing or fallback logic. This approach ensures multi-country outbreaks are delivered to all users and simplifies the data consumption model.
What lessons did the Tempy team learn about building reliable ETL pipelines from this experience?
They learned that dashboard URLs are not reliable APIs, and graceful fallback can mask silent failures. It's crucial to have smoke checks that verify non-empty outputs and to avoid unnecessary architectural complexity without clear failure modes or customer needs.
Continue reading

Cache-Control is a signed header — and that broke every R2 upload
Cloudflare R2 sends no Cache-Control, so our images revalidated on every load. Baking it in at upload broke every PUT with 403 SignatureDoesNotMatch — here's why, and the fix.

How a UK paracetamol bottle changed Tempy’s dose UI
Our dose calculator quietly assumed 160 mg/5 mL acetaminophen. UK paracetamol bottles ship at 250 mg/5 mL. Here’s why we surfaced the assumption.

The removeConsole: true bug that hid every other bug
A Linkgo Railway memory pass surfaced a one-line next.config.js bug that had been silently stripping every console.error in production — and the three reliability fixes that surfaced once we could read the logs.