Ah so its not only me that uses AWS primitives for hackily implementing all sorts of synchronization primitives.
My other favorite pattern is implementing a pool of workers by quering ec2 instances with a certain tag in a stopped state and starting them. Starting the instance can succeed only once - that means I managed to snatch the machine. If it fails, I try again, grabbing another one.
This is one of those things that I never advertised out of professional shame, but it works, its bulletproof and dead simple and does not require additional infra to work.
If you use hourly billed machines...Sounds like the world most expensive semaphore :-)
EC2 bills by the second.
Some...
"Your Amazon EC2 usage is calculated by either the hour or the second based on the size of the instance, operating system, and the AWS Region where the instances are launched" - https://repost.aws/knowledge-center/ec2-instance-hour-billin...
What would you say would be the "clean" way to implement a pool of workers (using EC2 instances too)?
Autoscaling and task queue based workloads, if my cloud theory is still relevant.
Agreed. Scaling based on the length of the queue, up to some maximum.
Even better, based on queue latency instead of length
The single best metric I've found for scaling things like this is the percent of concurrent capacity that's in use. I wrote about this in a previous HN comment: https://news.ycombinator.com/item?id=41277046
Scaling on things like the length of the queue doesn't work very well at all in practice. A queue length of 100 might be horribly long in some workloads and insignificant in others, so scaling on queue length requires a lot of tuning that must be adjusted over time as the workload changes. Scaling based on percent of concurrent capacity can work for most workloads, and tends to remain stable over time even as workloads change.
Yeah this is why I hate AWS - I did a similar task runner thing and what I ended up doing is just firing up a small controller instance which manually creates and destroys instances based on demand, and schedules work on them by ssh-ing into the running instances, and piping the logs to a db.
I did read up on the 'proper' solution and it made my head spin.
You're supposed to use AWS batch, creating instances with autoscaling groups, pipe the logs to CloudWatch, and serve it from the on the frontend etc.
The number of new concepts I'd have to master, I have no control over if they went wrong, except to chase after internet erudites and spending weeks talking to AWS support is staggering.
And there's the little things, like CloudWatch logs costing like $0.5/GB, while an EBS block volume costs like $0.08, with S3 being even cheaper than that.
If I go full AWS word salad, I'm pretty sure even the most wizened AWS sages would have no idea what my bills would look like.
Yeah, my solution is shit and Im a filthy subhuman, but at least I know how every part of my code works, and the amount of code I'd had to write is not more than double that if I used AWS solutions, but I probably saved a lot of time debugging proprietary infra.
It is a shame that comment is not a blog post!
You will like the Strange Loop 2017 talk about this subject:
"Stop Rate Limiting! Capacity Management Done Right" by Jon Moore https://www.youtube.com/watch?v=m64SWl9bfvk
Concurrent capacity might not be the best metric.
not sure, probably either an eks cluster with a job scheduler pod that creates jobs via the batch api. The scheduler pod might be replaced by a lambda. Another possibility is something cooked up with a lambda creating ec2 instances via cdk and the whole thing is kept track by a dynamodb table.
the first one is probably cleaner (though I don't like it, it means that I need the instance to be a kubernetes node, and that comes with a bunch of baggage).
this actually sounds interesting. do you precreate the workers beforehand and then just keep them in a stopped state?
yeah. one of the goals was startup time, so It made sense to precreate them. In practice we never ran out of free machines (and if we did, I have a cdk script to make more), and inifnite scaling is a pain in the butt anyways due to having to manage subnets etc.
Cost-wise we're only paying for the EBS volumes for the stopped instances which are like 4GB each, so they cost practically nothing, we spend less than a dollar per month for the whole bunch.
Warm pools are a supported feature in AWS on auto scaling groups. Works as you're describing (have a pool of instances in stopped state ready to use, only pay for EBS volume if relevant) https://aws.amazon.com/blogs/compute/scaling-your-applicatio...
> we spend less than a dollar per month for the whole bunch
This does not change the point, I'm just being pedantic, but:
4GB of gp3 EBS takes $0.32 per month, assuming a 50% discount (not unusual), less than a dollar gives only... 6 instances.
I always thought that stopped instances will cost money as well?!
You're only paying for the hard drive (and the VPC stuff, if you want to be pedantic). The downside is that if you try to start your instance, they might not start if AWS doesn't have the capacity (rare but have seen it happen, particularly with larger, more exotic instances.)