benjdd.com
92
93
mcdeltat 2 days ago

You have to be extremely careful with benchmarks like these. It's very easy to measure and conclude different things. In this case, it really only measures the compiler's ability to optimise the `for` loop construct - nothing else. It doesn't say anything much about the languages' general "speed", whatever that means (which is generally the implication with tests like this).

Some points to consider:

- `for` is not how you'd write an optimised loop in all languages.

- Be VERY suspicious of microbenchmarks on C, C++, and similar - the compilers are smarter than you think and you won't get the code you wrote.

- Languages do arrays differently - added memory effects on performance.

- Languages do numbers differently.

- Summing and modulo is somewhat contrived example. Modulo is one of the slower instructions out there.

comex 1 day ago

Specifically, there are two loops, one nested inside the other, and the total number of iterations is 1 billion.

I was hoping this would test how long it takes to compile 1 billion nested loops. As in, a 2-billion-line source file where the first billion lines each open a loop and the next billion lines close them all. That would be… interesting. Pointless, but interesting.

UltraSane 1 day ago

I would expect that would break the Python interpreter. Hmm. Now you have me wanting to try it. Writing Python that writes Python is always fun.

andreareina 21 hours ago

The indentation alone would be half an exabyte. You could transform it into an equivalent representation with a billion-variable list (or numpy array) that should fit within the limits of the interpreter.

maxmcd 2 days ago

Another fave of mine by this person: https://benjdd.com/az/

Waterluvian 1 day ago

Wow what a creative way to visualize the concept.

I think curves make it tricky to maintain objective perception, but the layout contributes significantly to the idea that internal latency is a different order from external.

ayaros 2 days ago

I'm working on a web app in JS, the sort of thing I think people on here will enjoy and I'll definitely share it when it's presentable. It's all in JS, including the graphical aspects of it (I'm drawing to a canvas). For per-pixel rendering I have 4 levels of nested loops. I feel like I've done everything in my power to optimize the process, and it's still not at the ideal speed. Really, it ought to be about 5-10 times faster than it is now.

I've been thinking of getting it into WASM somehow - it's really just one function that needs to be replaced by a WASM module - but one of the things which has given me pause are benchmarks like this, as well as other anecdotes, that demonstrate I'd only get a minor speedup from WASM. I guess part of this is due to modern JS engines being so optimized.

Just about all of the relevant code is adding and subtracting integers, but theres's some Boolean logic. There's also some multiplication in there that I could eliminate if I really had to, although at this point I'm not certain it would make a significant difference. Most of the data that contributes to the value of each pixel is spread across a tree of objects, and there's some polymorphism involved too. The point is it's not like I'm just dealing with a bunch of contiguous data in memory that is easily cached by the CPU, which might speed things up considerably. I guess I'm just venting now, lamenting a nightmare of my own creation...

catapart 2 days ago

For the record - I can't imagine any tangible value in writing a graphics engine (and form library) for an environment that was literally designed as a graphics engine for forms. But, that disclaimer aside, I've considered WASM for graphics for a game engine and I have had some trial success with allocating a shared array buffer, writing to it in WASM and reading from it at a js-friendly rate of 30-60 fps (whatever requestAnimationFrame is serving). The idea being that you treat the shared array buffer as your "screen" and write to it from the WASM as if it were an image buffer, while on the JS side you read it as an image buffer onto the canvas. Gives you a nice, finite, bandwidth and an easy to scale maximum for how much data you can pass through, and neither JS nor WASM ever has to pass any time-critical data.

Of course, this setup leaves a very thin JS client - it's essentially just rendering the same image over and over. So if you've already got everything done in JS, it would mean porting all of your functional code over to something that can compile into WASM. Not a huge lift, but not fun either.

And then, on the other hand, if it's just rendering you're seeing as a bottleneck, JS works pretty well with webGL and webGPU for simple form rendering. A little loop unrolling might be a lot easier than porting the rest of the app.

Anyway, best of luck on it!

ayaros 1 day ago

The graphics side of my program is structured as a tree of objects. The root is the entire display. Each object contains zero or more children, and masks those children so they only appear within the rectangular bounds of the object. If an object has an image class, and there are several of these using different data structures - then it further masks its children according to the image's alpha channel. I also have some blending modes that can be turned on. Any object can have a blending mode that affects appearance of the object based on the image of the object, or the images below it.

Is that something I could port to WebGL? Can objects contain or be parents of other objects in that way, and mask thier children to only appear directly in front of them? I have done a tiny bit of WebGL for a CS assignment but it was ages ago. Hopefully once I make this public I'll be able to get some feedback on all this. I don't want to say too much more as it would begin to spoil things!

catapart 1 day ago

It doesn't sound like you have an architecture that is particularly conducive to renderer-based programming, but that's not to say it's a difficult or complicated port. That all depends on the details.

When going to a dedicated graphics renderer, the biggest "change" from standard dev is "unrolling" your loops. So, right now, you have a bunch of parents and children and parents inform children. This is not, at all, different from a game engine which allows nesting of render objects (very common). Usually, the "transform" of a parent mesh is added to the transform of its children, for a very basic example.

If you're working in 3D, there's a bit of a complication with alpha channels because not only do you have to worry about blending, but you also have to deal with the intersection of alpha objects - simple front-to-back-then-back-to-front rendering won't cover that. But since you're just doing forms, you can be sure that all of your alpha content will be rendered after all of your opaque content, and blended according to its mode at that time, so it's a fairly straightforward version of "unrolling" that you would need.

But all of that is to say: no, WebGL (or any graphics API) won't have any kind of built in way to manage nesting objects. Those relations would have to be defined by you and uploaded to the GPU and then referenced separately, rather than looping through each parent and drilling down into the objects. It's just a different type of problem-solving. Rather than dealing with things per-object, you're dealing with them per-pixel (or pixel group). So you tell each pixel how it should render, based on what objects it is expected to describe at that pixel, and then let them all do their work at the exact same time. It's less "layer painting" and more "arrange and stamp".

lerp-io 1 day ago

have you done any benchmarks to compare wasm performance with js? i have been doing some benchmark test cases after a large refactor of my game engine written in typescript that is built entirely on top of shared buffer arrays (using ECS pattern or so it’s called i guess). i’m done benching for a while now and am ok with results but i have also compared my handcrafted quadtree sort/query algorithm with off the shelf rust implementation that matched the test requirements and saw only 30% gain and figured anything lower than that would not be worth the effort assuming rust compiled to wasm is same or slower and never bothered to test compiled wasm because i could not figure out how to share my buffer arrays with the rust code using wasm bindgen fast enough to bother spending more time on more benching lol

catapart 1 day ago

Nothing rigorous at all, no. I did do a lot of practical tests, as a way to compare what I had been doing to what was planned to be done, but I've found js is plenty performant for the types of CPU work I needed done, which is the big benefit of WASM.

Since I've mostly focused on rendering, there's just not much that really should be done on the CPU side. But I do have plans to extend my renderer to have a CPU pipeline which doesn't rely on WebGPU or WebGL, and I imagine I'll be doing quite a bit of benchmarking there. Forcing the CPU to render is going to make js choke - it's just a matter of how much it takes. I imagine any forward rendering will be fine, but there's no chance it does path tracing worth a damn. So somewhere between "textures" and "realtime global illumination" I expect I might have a lot of data about what WASM can do to help in that arena (if anything; though I am expecting it to help quite a bit, especially with stuff like SIMD).

On the other hand, entirely, I have been using the Rapier physics engine since it has a javascript version and as far as I know it heavily utilizes WASM for its physics calculations, so there might be some good benchmarking for WASM benefits surrounding that library.

refulgentis 1 day ago

I had a fun 2 years where I basically had 800 critical lines of math code that had to be on every platform, one version in the native language of the platform, and one version as performance optimal as possible.

I was absolutely stunned to find that on web WASM ~= JS which altogether weren't far off from a naive C++ stdlib port. I had to make custom data structures instead of using stdlib ones to really get performance significantly different . IIRC my most optimized C++ converted to WASM was about 70% the speed of JS.

com2kid 1 day ago

Don't write every pixel by hand to the screen. At high resolutions that just won't work. No modern UI does that anymore, they all use acceleration of some type. I ran into this trap when I tried to write a simple canvas game engine.

Just caching blobs of things that don't change frame to frame and got me a 5x-8x speedup, even if where I render those things at changes frame to frame.

ayaros 1 day ago

Trust me, I learned this pretty quickly. Areas are only queued when the display has to chnage in some way. There's also a bunch of prepratation before drawing a region of the screen; only the objects that appear within the specified bounds are even considered.

superjan 2 days ago

It’s a billion times integer modulus, summed. Div/mod is by far the slowest arithmetic instruction on a CPU, so it underestimates C/rust performance. I guess the author chose that to prevent the C compiler from optimizing the loop away.

bddicken 2 days ago

Author here: Yeah, the mod was added to reduce risk of loop(s) getting optimized away.

remcob 2 days ago

Did you verify this through disassembly? These loops can still be replaced by a closed form expression and I wouldn’t be surprised if LLVM figured this out.

namibj 1 day ago

Sooooo.... anyone know a compiler that actually does the closed form tactic on the loop(s)? If I see correctly, in theory the program could be compiled down to something that finishes nearly instantly, with no loops at all?

Dylan16807 2 days ago

If it turned the inner loop into a closed form expression, I would expect the outer loop to go through 10k iterations a lot faster than needing half a second.

zamadatix 2 days ago

Running the C code through Godbolt with the same Clang and -O3 optimizations it looks like you're right, there are still 2 loops being performed. The loops are unrolled to perform 4 iterations at a time but it's otherwise the same. Other than that it didn't do too much with the loops. Hilariously there is a god awful set of shifts and magic numbers... to save a single modulo operation from running once on the line declaring 'r'.

See here for why the Go version performs worse https://news.ycombinator.com/item?id=42251571

wanderingmind 2 days ago

How in the world is pypy faster than cpython by almost 2 orders of magnitude. Is pypy normally faster in other benchmarks. If so, why? Can some python expert pitch in?

lights0123 2 days ago

PyPy uses just-in-time compilation, so for arithmetic-heavy computations not involving the GC, it should absolutely be faster.

cuchoi 1 day ago

This is the exact code you would expect pypy to run faster.

dxbydt 1 day ago

Aside - a long time ago, I worked on a high school math problem which involved summing the thousand fractions 0/1, 1/2, 2/3,...upto 999/1000, so I did the obvious iteration.

(display (foldl + 0 (map (lambda (x) (/ x (+ x 1))) (range 0 1000))))

The solution matched and I was happy. Then I thought wouldn't it be neat to do this in a dozen languages and benchmark...but got nowhere with that idea. None of the languages supported fraction addition without jumping through whole bunch of hoops. I wonder if its still the same situation. If someone wants to give this a shot, the answer if you do this for the first ten is 17819/2520.

rotifer 1 day ago

In Haskell the numerator and denominator of the Rational type are unbounded integers, so one of the (many equivalent) ways of writing it is:

    ghci> sum [i % (i+1) | i <- [0..9]]
    17819 % 2520
    ghci> 
% is the constructor for Rational numbers. I.e. one half is written (and shown as) 1 % 2.

I'll spare people the full result, but:

    ghci> sum [i % (i+1) | i <- [0..999]]
    70755...77483 % 71288...20000
    ghci>

iambvk 2 days ago

Golang folks, after looking at the code, I am surprised and don't understand why Go version is slow compared to C/C++/Rust. Can anyone please explain? Thank you.

zamadatix 1 day ago

Go's int defaults to 64 bits on a 64 bit machine while C/Rust/Java are all defaulting to 32 bit on the author's machine.

Editing the code and running local to get a difference factor, it looks like the same Go code with int takes 1.5x the amount of time as the int32 code on my machine. That puts it ~right next to Java (assuming the perf change maps 1:1). Remove the GC and it comes in ~right next to C++.

Looking at the compiler assembly output on Godbolt it's fun to see all the extra panic info and checks Go adds before going to the loop whereas the C just segfaults from blasting forward anyways, sometimes with no detail.

Anyways, that's why I don't like these types of "benchmarks" (micro tests made into a media post). By the time someone actually looks into them to see what's up everyone has already seen the results and moved on.

iambvk 1 day ago

Thanks for this information. I wasn't aware that using 64-bit ints makes such a big difference in performance. Thank you.

zamadatix 21 hours ago

Usually less on computation performance (often identical or very close to in raw cycle count for the operation) but more on cache/memory performance. The way that maps is platform specific of course.

vocx2tx 1 day ago

64 bit mod is much slower than 32 bit mod. Your C program uses int32_t everywhere, while your Go program uses int (which is probably 64 bit on you machine); so this is not a fair comparison. Changing the Go program to use int32 everywhere makes it 35% faster on my machine.

bddicken 2 days ago

Author here: Yeah, I was surprised that there doesn't seem to be many options for extra optimizations in the Go compiler. Would be curious if Go experts have more insight or other compiler recommendations.

boyter 2 days ago

I doubt its the GC kicking in, but you could run it with the following environment variable set just to ensure that its not doing anything.

    GOGC=-1
EDIT: Trying it out quickly shows a small improvement actually, but so small as to likely be noise as I was doing other things on the machine.

    Summary
      ./mainnogc 1000 ran
        1.01 ± 0.06 times faster than ./maingc 1000

Splizard 20 hours ago

You've set the Go code to use int64 but the other faster implementations are using int32.

catapart 2 days ago

I'm surprised to see javascript so high up on the list! I would have thought PHP was faster, honestly. But I did run the benchmark in my browser and saw that the browser JS in chrome is slightly slower than Deno at ~1.2-1.3s. Even though firefox runs closer to 3 seconds, I'm still surprised that browser js is that much faster than some of those closer to the metal.

paulbgd 1 day ago

It’s impressive but even more interesting, it could still be faster! Java is JIT at #3, so theoretically any of these languages could get there.

TiredOfLife 1 day ago

It doesn't use PHP's jit

davikr 2 days ago

This does not seem like a realistic benchmark target.

mrkeen 2 days ago

Perhaps it's unrealistic in the sense that this is unrealistically-optimisable code.

Real-world code has closures, allocations, and pointer-dererencing. This code is just iterations, additions and modulus.

behnamoh 2 days ago

> This code is just iterations...

Which is what the average programmer out there writes. Therefore, I think this is actually a good realistic benchmark.

steveBK123 2 days ago

Precisely - its bad code, which is what most code is!

handfuloflight 2 days ago

What's a Github repository with good code?

steveBK123 1 day ago

All code is bad, some is useful

handfuloflight 1 day ago

Useful is good. Ergo transitive property.

intelVISA 1 day ago

Nobody's ever seen it!

elzbardico 2 days ago

What the average programmer out there writes is just a tip on the iceberg of what gets executed. And an incredibly small tip.

deanCommie 2 days ago

The other problem with it is that there are other variables involved besides pure speed, which is CONSISTENCY of speed.

Folks are always surprised to see just how fast Java is able to perform on benchmarks - it has a reputation as such a slow language then how come it's able to execute almost as fast as C++?

But Java's problem isn't execution performance. It's:

* Startup performance. It takes time to instantiate the JVM. That matters for some, not for others.

* Consistent performance. This is the nature of Garbage Collection. It can surprise you when it executes you and drop a performance blip when you least expect it.

Most developers who think they need the fastest performance actually need CONSISTENT performance more than the fastest.

vitus 1 day ago

> * Consistent performance. This is the nature of Garbage Collection. It can surprise you when it executes you and drop a performance blip when you least expect it.

This has been getting much better in Java in recent years. Shenandoah has been the default since Java 15, and we've been seeing further improvements since then.

> Processing thread stacks concurrently gives us reliable sub-millisecond pauses in JDK 17.

https://developers.redhat.com/articles/2021/09/16/shenandoah...

oftenwrong 1 day ago

Shenandoah was deemed production-ready in OpenJDK 15, but G1 remains the default collector in 23.

vitus 1 day ago

I stand corrected. I probably got confused by the statement that Shenandoah is bundled by default as of JDK 12.

PaulHoule 2 days ago

I even see some video games (say Dome Keeper) that periodically go out to lunch even on a top of the line gaming PC and understand that kind of game is often written in C# which has a garbage collector. Thing is I remember playing games on the much weaker PS Vita that were written in C# but I never remember pausing like that.

VR games are now the frontier of gaming (in terms of the industry outdoing itself) and there you're approaching hard real time requirements because it is so awful to get seasick. I appreciate it.

bob1029 2 days ago

I think it might have more to do with the specific implementation than the tool.

Most incarnations of C# offer a GC.Collect method and an ability to configure the threading model around collections. Used properly, you can keep maximum delays well-bounded. You still have to work your ass off to minimize allocations, but when some do inevitably occur due to framework internals, etc., you don't want to have to clean up 3 hours worth of it at once. Do it every frame or scene change.

MrLeap 1 day ago

These hitches (in unityland anyways) usually mean the dev is allocating a lot of short lived objects in their render loop. Even little heuristics like keeping the new keyword out of Update() and using object pools prevent the majority of things like this for us over on the unity side.

Dome Keeper's dev Bippinbits is a Godot studio. Some comments on their opinions that I can't read until I dig out my twitter login lol. https://x.com/Bippinbits/status/1702351798302851398

Godot has some perf warts still being churned out, so I don't know what all they've got to do to finesse around them. Even if GDScript was greasy gcc lightning, it's really easy to crunch yourself into some compromises when you're trying to eat your way out of the bottom of ticket mountain before Nextfest or you starve or whatever. Small team gamedev is wild.

PaulHoule 1 day ago

I am wondering also if my big computer with 64G is part of the problem: if I allocated less RAM to the process I wonder if the GC breaks would be shorter but more frequent.

ants_a 1 day ago

Javas problem is a culture of overly generic solutions (vendor neutral!) that then try to fix architectural performance issues by liberally adding caching. This makes both the startup time and consistency way worse than it needs to be.

moralestapia 2 days ago

Nor does it pretend to be.

It's literally just "1 Billion nested loop iterations", methods and results.

You make of that whatever you want.

obituary_latte 2 days ago

The animation is confusing...is one trip to the end and back the 1B iterations? I thought maybe they were counting up and once the slowest reached the end they would all stop but that didn't seem to be it. Maybe if it did (and there was a counter on each showing how many bounces they made), it'd be a little more clear.

Interesting nonetheless.

ayaros 2 days ago

Looks like they are just bouncing back and forth at relative speeds to each other; the speed of each ball is all that matters. I also found this diagram confusing at first.

xutopia 2 days ago

It blows my mind to see that speed difference between different languages. In even the slowest language there we have lots of ways we could optimize this contrived code but this is not something that we can always do in the real world. Seeing the visualization here really underscores how fast some languages are.

swatcoder 2 days ago

> Seeing the visualization here really underscores how fast some languages are.

Yes and no.

It emphasizes that there are some scenarios where language choice can have an outsized effect on performance when running on modern architectures, which is a reminder some people could use or have never really witnessed.

But it's a very particular scenario that relatively few projects will encounter at a scale that meaningfully impacts them.

In particular, C and Rust are much better positioned to instruction pipelining and lack of cache busting because there are no conditionals and little-to-no out-of-loop code to execute. Even with a massive loop like this, that can be a real world scenario in things like data/signal processing and numerical computation. When you're doing those things, language choice can make a huge difference that may not be apparent without understanding how they work or at least seeing visualizations like this.

But even in C and Rust, most applications don't have to do this kind of thing. Most are dominated by smaller loops, with branching that breaks pipelining, complexity that busts cpu cache, calls and jumps that might interfere with both etc. In those much more common cases, the practical performance difference is going to be quite a lot less significant, and the fluency/ergonomics/safety/convenience of working in some other language can easily matter more.

PaulHoule 2 days ago

I like the extreme boost that PyPy gets here which is most more than you usually get with PyPy, on one hand it shows the potential of PyPy but it is testing just one area where PyPy is strong, other workloads depend on performance that is not well optimized.

My take though is that branchy, old-AI, and heavy code (say keep every fact in a triple store and not as a Python object at the benefit of having an index for every permutation of ?s-?p-?o) that cannot be handled with numpy or similar tools benefits greatly from PyPy, but not 100x.

omoikane 1 day ago

I thought the emphasis was on the visualization part, and not so much on the operations that are being benchmarked. Most benchmarks are presented as static bar charts or similar, but this particular site presented them as moving circles, which I thought was a very effective illustration of speed.

liontwist 1 day ago

Programming style is as much a determiner of pipeline and cache friendly code as domain. It’s a set of techniques that can be learned.

sysread 2 days ago

I tried this with elixir, and on my m3 mac, it ran in 6.5s. Perl took over 40s :P

  #!/usr/bin/env elixir
  u =
    case System.argv() do
      [arg] -> String.to_integer(arg)
      _ -> raise "Please provide a valid integer as the first argument."
    end

  r = :rand.uniform(10_000) - 1

  a = :array.new(10_000, default: 0)

  a =
    Enum.reduce(0..9_999, a, fn i, acc ->
      inner_sum =
        Enum.reduce(0..99_999, 0, fn j, sum ->
          sum + rem(j, u)
        end)

      updated_value = :array.get(i, acc) + inner_sum + r
      :array.set(i, updated_value, acc)
    end)

  IO.puts(:array.get(r, a))

tmtvl 2 days ago

Oh, the programs are supposed to take an integer from the command line? Then I wonder how many of them would fail when given an integer that's just a wee bit too large (like, say (2^129)-1).

hatthew 2 days ago

I don't think that's a concern. My assumption is that the point of taking user input is to prevent a compile from precomputing the result.

JohnMakin 2 days ago

Looking at the source code, is the random value generated before the main loop iterations to stop the compiler from shortcutting all the iterations? Does that work on some of the better compilers out there? The random value is only generated once. I would like to see a similar benchmark on creating/destroying objects in memory.

neonsunset 2 days ago

If you are interested in looking into the details of microbenchmarks of this type, there is always https://benchmarksgame-team.pages.debian.net/benchmarksgame/... which has clearly defined methodology and allows to understand wall-clock time vs CPU cost too.

The benchmark code linked in the submission unfortunately suffers from the mistakes that happen whenever you write a benchmark for the first time. The visuals are pretty but might as well have been based on something more comprehensive.

(also, please stop measuring performance with loops that perform no work and fibonacci sequences, this was bad 30 years ago, this is bad today too)

JohnMakin 2 days ago

Cool, thanks! In another lifetime I had an internship at an embedded software company which was to write tests and optimize compute or throughput for various chips and chip firmware, and reading the functions here sparked some alarm bells in my head. Still a fun idea though and cool graphics.

swatcoder 2 days ago

Yeah, the random value prevents compile-time pre-calculation and the print makes sure the loops actually get run and values actually get calculated (as modern compilers are wont to simply omit code that clearly doesn't have consequence).

o11c 2 days ago

Note that these benchmarks ignore object-related stuff, which is usually the expensive part.

ramshanker 1 day ago

Personally I have gone max 5 level deep. It worked reasonably well with agressive break condition checking in my specific case.

How deep people have actually written useful nested loops?

fbn79 2 days ago

Would be nice to know max RAM usage by each implementation.

MrLeap 1 day ago

I would have loved to see C# on this list. Wonder if he's accepting PR's.

neonsunset 1 day ago

There are...a lot of PRs (and 4 as of this moment which add C#).

pbw 2 days ago

I wrote this to probe the question of why “slow” languages are popular and valid when used in conduction with fast ones: https://tobeva.com/articles/beyond/

eiathom 2 days ago

It's interesting to imagine the head scratching going on in this thread. All these clever people, stumped. 'How can this be!???'. Quite the post, bravo OP.

But seriously, other comments have this nailed. This 'benchmark' isn't comparing apples with apples. There are too many quirks, optimisations and design decisions (of a compiler) that have not being taken into account.

Also, as others have repeated, this isn't indicative of actual usage - real world usage. In that regard, this type of 'benchmark' isn't scientific at all, and worse, is misleading a naive reader.

Dylan16807 2 days ago

While it's not (necessarily) indicative of the languages as a whole, it looks pretty apples to apples to me.

These are very simple programs with no fancy tricks going on.

hatthew 2 days ago

It's comparing a granny smith apple to a mcintosh apple by hitting them with a baseball bat and inferring the texture of the apple from that. It uses them in a way that almost nobody does in real life, and is a vague proxy to only one of several important factors. If you use this test as a basis for deciding which you would like, you're going to have a bad time, whereas you'll be fine if you merely look at what other people like.

Dylan16807 1 day ago

Is the "way" it's using things affecting performance significantly?

If not then it's a pretty good proxy to one or two dimensions of performance, and a mix of similar benchmarks could paint a good fraction of a reasonable picture.

hatthew 1 day ago

If you want to buy a bushel of apples and hit them with an array of objects at varying velocities and then try to construct a model that can use the resulting splatter patterns to infer the texture of the apple as experienced by humans, you're welcome to do so.

Personally, I'm going to bite the apple.

Dylan16807 1 day ago

I do not accept your analogy.

This program is a perfectly reasonable "bite". It's not doing weird stuff.

hatthew 1 day ago

The program is a useless toy that nobody would write and expect to be performant in real life. I can write an optimized version of the algorithm that prints the exact same output (barring edge cases) without looping. With this new algorithm (really just a single expression), the C program is 500x faster than before and the python script is 5000x faster than before. So python is only 10x slower than C in this example (and I'd imagine that this is measuring more overhead and less arithmetic).

Presumably the C compiler has a relative advantage at optimizing the looped algorithm vs the arithmetic algorithm. How much? Is C 100x faster than python, or was the compiler able to get closer to the non-looping algorithm than python? How close? Is it something about the arithmetic or the looping that's natively faster in C? What part of it was optimized by the compiler? Since nobody does large loops in python, would python be more comparable if we introduced a C-backed library like numpy?

What other objects do you think we should bash this apple with?

Dylan16807 1 day ago

Do you think a slightly more complex piece of math that can't be replaced with an equation would have significantly different performance from the benchmark?

I don't. So the simple math just makes things easier, it doesn't harm results.

Your suggestion to skip the math entirely is the apple-smash.

> Since nobody does large loops in python

Oh I doubt that.

Numpy would be fine as a different entry.

hatthew 1 day ago

I don't know whether or not a more complex piece of math would change the outcome, so I would err on the side of not making assumptions until I had done more benchmarks and investigated the inner workings more. You probably know more about C than I do and might be able to make that assumption more confidently, but I think most people know less than either of us, and will read too much into a single statistic without context.

Nobody who cares about performance enough to look at a benchmark like this would do large loops in pure python.

Dylan16807 1 day ago

> Nobody who cares about performance enough to look at a benchmark like this would do large loops in pure python.

Because they already know it would be very slow, and putting specific numbers on that is helpful for them and for non-veterans too. Even with some wiggle room for "microbenchmarks are weird".

hatthew 1 day ago

I already know that python is generally slower than C. This microbenchmark tells me that python is 100x slower than C, but only in these specific circumstances that wouldn't happen in the real world, but may or may not be similar enough to the real world to infer that C could be 10^(2±1) faster, but only for a small portion of someone's program, and only if they don't use one of the libraries that they probably do use.

Personally, I find it misleading for the results of a benchmark to be posted when it's not made clear up front what the benchmark is measuring. At minimum I'd want a disclaimer that mentions at least one of the first two points mentioned in the top comment https://news.ycombinator.com/item?id=42250205

Dylan16807 1 day ago

You keep acting like this code is super far from the real world, but it isn't. It's representative of basic number crunching just fine.

The one serious caveat is that someone that already knows CPython is godawful at this will switch to numpy, but that reinforces what the benchmark says.

None of the things in the post are going to change the order of magnitude here.

fifilura 2 days ago

Doing Python or R right means using vector/matrix operations in e.g. numpy. Those will be faster.

Just think twice before using a for loop. Most of the time it is not needed.

nickforr 2 days ago

If you write the R code using vectorised operations it’s significant orders of magnitude faster (148 seconds to 1.5 milliseconds on posit.cloud).

The nested loop becomes: j <- sum((seq_len(100000) - 1) %% u) a <- a + j + r

MooseBurger 2 days ago

I disagree w.r.t. python. You shouldn't need a numerical computing library to do a for loop generally.

fifilura 2 days ago

Kick the habit. For loops are what "goto" was 20 years ago. Error prone and not possible to abstract for example for parallel execution.

hatthew 1 day ago

I think the comparison to goto is a little extreme, but actually not too crazy. I could see a future where the accepted best practice is to use parallelized operations for everything, and only use sequential loops in cases where the underlying concept is inherently sequential (gradient descent, iterative hashing, etc.).