Item 43897042

dominicq • 4 days ago

Can you say more about the use case? What problem were you solving? How did it work exactly? Sounds interesting so I'd like to learn more.

tzury • 4 days ago

Sure.

We were building Reblaze (started 2011), a cloud WAF / DDoS-mitigation platform. Every HTTP request—good, bad, or ugly—had to be stored for offline anomaly-detection and clustering.

   Traffic profile

     - Baseline: ≈ 15 B requests/day
     - Under attack: the same 15 B can arrive in 2-3 hours

Why BigQuery (even in alpha)?

It was the only thing that could swallow that firehose and stay query-able minutes later — crucial when you’re under attack and your data source must not melt down.

Pipeline (all shell + cron)

Edge nodes → write JSON logs locally and a local cron push to Cloud Storage

Tiny VM with a cron loop

   - Scans `pending/`, composes many small blobs into one “max-size” blob in `processing/`.
   - Executes `bq load …` into the customer’s isolated dataset.
   - On success, moves the blob to `done/`; on failure, drops it back to `pending/`.

Downstream ML/alerting* pulls straight from BigQuery

That handful of `gsutil`, `bq`, and `mv` commands moved multiple petabytes a week without losing a byte. Later pipelines—Dataflow, Logstash, etc.—never matched its throughput or reliability.