What would you say would be the "clean" way to implement a pool of workers (using EC2 instances too)?
Autoscaling and task queue based workloads, if my cloud theory is still relevant.
Agreed. Scaling based on the length of the queue, up to some maximum.
Even better, based on queue latency instead of length
The single best metric I've found for scaling things like this is the percent of concurrent capacity that's in use. I wrote about this in a previous HN comment: https://news.ycombinator.com/item?id=41277046
Scaling on things like the length of the queue doesn't work very well at all in practice. A queue length of 100 might be horribly long in some workloads and insignificant in others, so scaling on queue length requires a lot of tuning that must be adjusted over time as the workload changes. Scaling based on percent of concurrent capacity can work for most workloads, and tends to remain stable over time even as workloads change.
Yeah this is why I hate AWS - I did a similar task runner thing and what I ended up doing is just firing up a small controller instance which manually creates and destroys instances based on demand, and schedules work on them by ssh-ing into the running instances, and piping the logs to a db.
I did read up on the 'proper' solution and it made my head spin.
You're supposed to use AWS batch, creating instances with autoscaling groups, pipe the logs to CloudWatch, and serve it from the on the frontend etc.
The number of new concepts I'd have to master, I have no control over if they went wrong, except to chase after internet erudites and spending weeks talking to AWS support is staggering.
And there's the little things, like CloudWatch logs costing like $0.5/GB, while an EBS block volume costs like $0.08, with S3 being even cheaper than that.
If I go full AWS word salad, I'm pretty sure even the most wizened AWS sages would have no idea what my bills would look like.
Yeah, my solution is shit and Im a filthy subhuman, but at least I know how every part of my code works, and the amount of code I'd had to write is not more than double that if I used AWS solutions, but I probably saved a lot of time debugging proprietary infra.
It is a shame that comment is not a blog post!
You will like the Strange Loop 2017 talk about this subject:
"Stop Rate Limiting! Capacity Management Done Right" by Jon Moore https://www.youtube.com/watch?v=m64SWl9bfvk
Concurrent capacity might not be the best metric.
not sure, probably either an eks cluster with a job scheduler pod that creates jobs via the batch api. The scheduler pod might be replaced by a lambda. Another possibility is something cooked up with a lambda creating ec2 instances via cdk and the whole thing is kept track by a dynamodb table.
the first one is probably cleaner (though I don't like it, it means that I need the instance to be a kubernetes node, and that comes with a bunch of baggage).