Can you say more about the use case? What problem were you solving? How did it work exactly? Sounds interesting so I'd like to learn more.
Sure.
We were building Reblaze (started 2011), a cloud WAF / DDoS-mitigation platform. Every HTTP request—good, bad, or ugly—had to be stored for offline anomaly-detection and clustering.
Traffic profile
- Baseline: ≈ 15 B requests/day
- Under attack: the same 15 B can arrive in 2-3 hours
Why BigQuery (even in alpha)?It was the only thing that could swallow that firehose and stay query-able minutes later — crucial when you’re under attack and your data source must not melt down.
Pipeline (all shell + cron)
Edge nodes → write JSON logs locally and a local cron push to Cloud Storage
Tiny VM with a cron loop
- Scans `pending/`, composes many small blobs into one “max-size” blob in `processing/`.
- Executes `bq load …` into the customer’s isolated dataset.
- On success, moves the blob to `done/`; on failure, drops it back to `pending/`.
Downstream ML/alerting* pulls straight from BigQueryThat handful of `gsutil`, `bq`, and `mv` commands moved multiple petabytes a week without losing a byte. Later pipelines—Dataflow, Logstash, etc.—never matched its throughput or reliability.