Item 42246846

Even better, based on queue latency instead of length

jcrites • 2 months ago

The single best metric I've found for scaling things like this is the percent of concurrent capacity that's in use. I wrote about this in a previous HN comment: https://news.ycombinator.com/item?id=41277046

Scaling on things like the length of the queue doesn't work very well at all in practice. A queue length of 100 might be horribly long in some workloads and insignificant in others, so scaling on queue length requires a lot of tuning that must be adjusted over time as the workload changes. Scaling based on percent of concurrent capacity can work for most workloads, and tends to remain stable over time even as workloads change.

3 replies

torginus • 2 months ago

Yeah this is why I hate AWS - I did a similar task runner thing and what I ended up doing is just firing up a small controller instance which manually creates and destroys instances based on demand, and schedules work on them by ssh-ing into the running instances, and piping the logs to a db.

I did read up on the 'proper' solution and it made my head spin.

You're supposed to use AWS batch, creating instances with autoscaling groups, pipe the logs to CloudWatch, and serve it from the on the frontend etc.

The number of new concepts I'd have to master, I have no control over if they went wrong, except to chase after internet erudites and spending weeks talking to AWS support is staggering.

And there's the little things, like CloudWatch logs costing like $0.5/GB, while an EBS block volume costs like $0.08, with S3 being even cheaper than that.

If I go full AWS word salad, I'm pretty sure even the most wizened AWS sages would have no idea what my bills would look like.

Yeah, my solution is shit and Im a filthy subhuman, but at least I know how every part of my code works, and the amount of code I'd had to write is not more than double that if I used AWS solutions, but I probably saved a lot of time debugging proprietary infra.

ndjdjddjsjj • 2 months ago

It is a shame that comment is not a blog post!

1 reply

Lanzaa • 2 months ago

You will like the Strange Loop 2017 talk about this subject:

"Stop Rate Limiting! Capacity Management Done Right" by Jon Moore https://www.youtube.com/watch?v=m64SWl9bfvk

Concurrent capacity might not be the best metric.

1 reply

ndjdjddjsjj • 2 months ago

Chefs kiss!

stolsvik • 2 months ago

Duty Time, straight out.