Item 42243447

torginus • 2 months ago

Ah so its not only me that uses AWS primitives for hackily implementing all sorts of synchronization primitives.

My other favorite pattern is implementing a pool of workers by quering ec2 instances with a certain tag in a stopped state and starting them. Starting the instance can succeed only once - that means I managed to snatch the machine. If it fails, I try again, grabbing another one.

This is one of those things that I never advertised out of professional shame, but it works, its bulletproof and dead simple and does not require additional infra to work.

belter • 2 months ago

If you use hourly billed machines...Sounds like the world most expensive semaphore :-)

2 replies

messe • 2 months ago

EC2 bills by the second.

2 replies

belter • 2 months ago

Some...

"Your Amazon EC2 usage is calculated by either the hour or the second based on the size of the instance, operating system, and the AWS Region where the instances are launched" - https://repost.aws/knowledge-center/ec2-instance-hour-billin...

https://aws.amazon.com/ec2/pricing/on-demand/

1 reply

QuinnyPig • 2 months ago

MacOS instances appear to be the sole remaining exception since RHEL got on board.

1 reply

redeux • 2 months ago

Thanks Corey. Always nice to get the TL;DR from an authority on the subject.

JoshTriplett • 2 months ago

With a one-minute minimum, unfortunately.

torginus • 2 months ago

except we are actually using them :)

1 reply

belter • 2 months ago

Just don't call them before the hour and start a different one again.Because otherwise within the hour, you will be billed for hundreds of hours...If they are of the type billed by the hour....

_zoltan_ • 2 months ago

this actually sounds interesting. do you precreate the workers beforehand and then just keep them in a stopped state?

1 reply

torginus • 2 months ago

yeah. one of the goals was startup time, so It made sense to precreate them. In practice we never ran out of free machines (and if we did, I have a cdk script to make more), and inifnite scaling is a pain in the butt anyways due to having to manage subnets etc.

Cost-wise we're only paying for the EBS volumes for the stopped instances which are like 4GB each, so they cost practically nothing, we spend less than a dollar per month for the whole bunch.

3 replies

zild3d • 2 months ago

Warm pools are a supported feature in AWS on auto scaling groups. Works as you're describing (have a pool of instances in stopped state ready to use, only pay for EBS volume if relevant) https://aws.amazon.com/blogs/compute/scaling-your-applicatio...

1 reply

zerd • 2 months ago

If you want even fast startups restarting stopped instances is apparently faster https://depot.dev/blog/faster-ec2-boot-time

rfoo • 2 months ago

> we spend less than a dollar per month for the whole bunch

This does not change the point, I'm just being pedantic, but:

4GB of gp3 EBS takes $0.32 per month, assuming a 50% discount (not unusual), less than a dollar gives only... 6 instances.

merb • 2 months ago

I always thought that stopped instances will cost money as well?!

1 reply

torginus • 2 months ago

You're only paying for the hard drive (and the VPC stuff, if you want to be pedantic). The downside is that if you try to start your instance, they might not start if AWS doesn't have the capacity (rare but have seen it happen, particularly with larger, more exotic instances.)

williamdclt • 2 months ago

What would you say would be the "clean" way to implement a pool of workers (using EC2 instances too)?

3 replies

Cthulhu_ • 2 months ago

Autoscaling and task queue based workloads, if my cloud theory is still relevant.

1 reply

twodave • 2 months ago

Agreed. Scaling based on the length of the queue, up to some maximum.

1 reply

giovannibonetti • 2 months ago

Even better, based on queue latency instead of length

1 reply

jcrites • 2 months ago

The single best metric I've found for scaling things like this is the percent of concurrent capacity that's in use. I wrote about this in a previous HN comment: https://news.ycombinator.com/item?id=41277046

Scaling on things like the length of the queue doesn't work very well at all in practice. A queue length of 100 might be horribly long in some workloads and insignificant in others, so scaling on queue length requires a lot of tuning that must be adjusted over time as the workload changes. Scaling based on percent of concurrent capacity can work for most workloads, and tends to remain stable over time even as workloads change.

3 replies

torginus • 2 months ago

Yeah this is why I hate AWS - I did a similar task runner thing and what I ended up doing is just firing up a small controller instance which manually creates and destroys instances based on demand, and schedules work on them by ssh-ing into the running instances, and piping the logs to a db.

I did read up on the 'proper' solution and it made my head spin.

You're supposed to use AWS batch, creating instances with autoscaling groups, pipe the logs to CloudWatch, and serve it from the on the frontend etc.

The number of new concepts I'd have to master, I have no control over if they went wrong, except to chase after internet erudites and spending weeks talking to AWS support is staggering.

And there's the little things, like CloudWatch logs costing like $0.5/GB, while an EBS block volume costs like $0.08, with S3 being even cheaper than that.

If I go full AWS word salad, I'm pretty sure even the most wizened AWS sages would have no idea what my bills would look like.

Yeah, my solution is shit and Im a filthy subhuman, but at least I know how every part of my code works, and the amount of code I'd had to write is not more than double that if I used AWS solutions, but I probably saved a lot of time debugging proprietary infra.

ndjdjddjsjj • 2 months ago

It is a shame that comment is not a blog post!

1 reply

Lanzaa • 2 months ago

You will like the Strange Loop 2017 talk about this subject:

"Stop Rate Limiting! Capacity Management Done Right" by Jon Moore https://www.youtube.com/watch?v=m64SWl9bfvk

Concurrent capacity might not be the best metric.

1 reply

ndjdjddjsjj • 2 months ago

Chefs kiss!

stolsvik • 2 months ago

Duty Time, straight out.

torginus • 2 months ago

not sure, probably either an eks cluster with a job scheduler pod that creates jobs via the batch api. The scheduler pod might be replaced by a lambda. Another possibility is something cooked up with a lambda creating ec2 instances via cdk and the whole thing is kept track by a dynamodb table.

the first one is probably cleaner (though I don't like it, it means that I need the instance to be a kubernetes node, and that comes with a bunch of baggage).

ndjdjddjsjj • 2 months ago

etcd?