Wow - we're on Hacker News again! Hello all, and thanks for the link and discussion! While we're here, we recently made some stats public: https://stats.compiler-explorer.com/ (spot the Hacker News spike!)
I'm pretty sure you already know of sharplab.io. A lot of us C# devs prefer it over Compiler Explorer for a bunch of reasons, but the main one is really the speed.
My guess is that Sharplab gets it speed because it hooks directly into the JIT compiler itself. There is no overhead from touching disk or the linker.
Hi Genbox, thanks for the kind words. I've never heard of sharplab.io - the speed of C# compilation is not very high on the priorities of CE right now (though we have some issues filed, and some (stale) PRs to help). We welcome help improving it!
If you're talking about the speed of other languages than C#, then most of the time is in the compilers itself, not the fairly limited pre- and post-processing we do. We have a pretty sophisticated setup (at least, I think it is...) for handling the 1,500 compiler/language combinations, hundreds of libraries etc (around 2TB of data), but there's always tradeoffs between fast start-up time, manageability and simplicity of adding new compilers, and the run-time performance of running the compilers etc.
A small tip when visiting godbolt: you can use the name of the language you're interested in as a subdomain, to get a page immediately set up for that language, rather than starting with the default C++. For example https://erlang.godbolt.org or https://rust.godbolt.org
It only works for languages, not compilers. _almost_ everything *.godbolt.org resolves to Compiler Explorer (I have some other projects at this domain), but *.compiler-explorer.com and *.godbo.lt always point at CE.
I'm not sure how many languages have alternative compilers available, but I suppose you could make that a feature request: to have a direct way to choose the compiler with the a subdomain too. https://clang.c.godbolt.org Or maybe it would make more sense in the path: https://c.godbolt.org/clang
When I was still programming in C++ that tool was the method to discuss compiler internals and language semantics with colleagues. Just setup a minimal example and share it with your colleagues. Impressive and sad at the same time. Impressive for obvious reasons. Sad, because the language is so confabulated that there is no easy concise way to talk about its semantics.
I also use it for this purpose, however I have come to hate the way other people use it for this. I have a colleague who has really no idea what he's talking about with respect to machine performance, and who did not have the requisite knowledge of how to peep at the assembly code of a given function with the standard tools like objdump, who now loves to send everyone godbolt links in slack, along with his suppositions about which function will be faster, based entirely on vibes (mostly, instruction count). This drives me up the wall. I wish there was some minimum height needed to ride godbolt.
I go to some pain in my talks to say "instruction count is not a good proxy for performance", but unfortunately folks do still use it. It's handy to say "hey, there's no loop in this output" or "this loop does 3 multiplies; the alternative does two and an add" or similar. It's a mixed blessing to have brought the compiler output to the masses, I can only hope it starts a useful learning process!
Now any load from the L3 cache memory or from the main memory takes much more time than any other instruction (not counting exceptions generated by instructions, which include many memory accesses that slow them down, or deprecated instructions that are kept for backwards compatibility and that are executed by long microcode sequences).
Assuming that "caring about performance" = "you are in a tight loop": Here's a tool that simulates/visualizes instruction flow and data dependencies over multiple loop iterations.
Paste in assembly code, check "Trace Table" and run, then "Open Trace". Not sure if it will help with your annoying colleague, but it gives a much more concrete idea about how a processor will execute any given code.
Or, if you want to channel their energy into something slightly more direct, there's also https://quick-bench.com/ which allows easy micro-benchmarking. Still not guaranteed to be relevant to any real-world scenario, but more data-driven than "vibes".
Using Compiler Explorer to see how different compilers interpret the same code, or understanding the generated ABI, or if various pragmas are or are not working, etc., is a very good use of it--I suspect most compiler developers at this point more or less have a tab of Compiler Explorer permanently open at this point.
> I have a colleague who has really no idea what he's talking about with respect to machine performance, and who did not have the requisite knowledge of how to peep at the assembly code of a given function with the standard tools like objdump, who now loves to send everyone godbolt links in slack, along with his suppositions about which function will be faster, based entirely on vibes (mostly, instruction count).
It's hard to measure performance in realistic situations; it's easy to measure code size. I recently found myself wasting time doing micro-optimizations, encouraged by the feedback loop of measuring and reducing code size (in my case, not with Compiler Explorer, but with "cargo bloat", since I was working on a Rust project.)
I know you know this but instructions are basically free in the face of memory access and random memory access is the worst. Linear scans over contiguous memory (per thread) generally optimizes performance.
Instruction counts are only useful if everything is guaranteed to be in registers.
It can be tough even without the impact of the memory hierarchy. I've seen code where adding an extra instruction to the calculation made it faster. The extra instruction implicitly eliminated denormals, thus resulting in faster execution with some workloads on systems where operations on denormal values were slower than operations on other values.
It was a completely unnecessary instruction from a correctness perspective, because it had no effect on the answer. However, it was important for performance; removing the instruction made the calculation slower.
How $lang maps to assembly is half the picture: how assembly maps to CPU is the other half. We shouldn't blame ignorance of the latter on a tool for exploring the former.
I do totally get how some people learn just enough to be annoying. Generally I still think that's not a good reason to gatekeep them.
objdump sucks, source annotations via coloring makes it at least 5x faster to read assembly, I don’t care how smart you are or how fluent you are in assembly. If your colleague is wrong for other reasons, that’s orthogonal.
yes but there's real value in exploration. I haven't touched c or assembly for a long time. here's a cold read.
push rbp
this is going to take the contents of rbp and push it onto the top of the stack - this will probably also change the stack pointer
mov rbp, rsp
move goes left <-, like a = 5, not 5 = a.
so, copy the updated stack pointer into rbp
mov DWORD PTR [rbp-4], edi
now, I'm not 100% sure, but I believe this guy puts edi just under the value we pushed to the top of the stack
mov eax, DWORD PTR [rbp-4]
Take that value, and put it into eax, I'm not 100% sure why it's not just mov eax edi.
imul eax, eax
integer multiply, this is the part that does the double.
pop rbp
restore rbp (which we messed with)
ret
and we're done.
there are at least three holes in my understanding - but those three are not _that_ hard to track down.
1, does the stack pointer actually auto increment? (I think it does)
2, imul overflow and setting sign flags and such. - that shouldn't be hard to run down.
3, what is the c calling convention? it looks like the argument is top of stack, but also in edi - is that shuffling really needed? I think there's a bucket of implicit behavior there that's kinda scary.
I would _hope_ unless linking to a library, whatever called this, just did the imul eax eax.
My understanding may be deeply flawed, but explaining my assumptions and my understanding does two things.
1, it helps me learn.
2, it helps others re-evaluate their assumptions and possibly see from a different viewpoint.
I'm not saying spam compiler lists. But a clear and well thought out question can certainly advance discussion. It forces people to formalize their assumptions.
The default godbolt page runs the compiler with no flags, which means without any optimizations. This explains why the code unnecessarily shuffles stuff to the stack and back. Unoptimized clang/llvm output spills everything to the stack, and register allocation is an optimization.
With -O3, the code is:
imul edi, edi
mov eax, edi
ret
Yep, the calling convention for x86-64 on Linux and macOS passes the first six integer arguments in rdi, rsi, rdx, rcx, r8, and r9, and then spills to stack.
Having originally learned the basics of assembly on the chronically register-deprived x86, it took me a while to get used to the fact that standard CCs now pass things in registers (and rsi and rdi in particular, retaining their ancient names while being completely general-purpose these days).
And user netch on stack overflow wrote this which explains more:
notice also there is a 128-byte space ("red zone") before %rsp that keeps its contents between function calls but preserved by OS during interrupts. So, very temporary values (between function calls) can be used with negative offsets to %rsp. Not all compilers utilize this.
About this code (note opposite order of register movement - there are two main styles of displaying x86 assembly code):
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
the comment was:
Compiler allocates some space for local values on function enter. That's why it subtracts value from %rsp on enter. This doesn't depend on whether %rbp is used as frame pointer. After that, this place is used with positive offsets upon %rsp. Also, if this function calls another one, %rsp shall be aligned on 16-byte boundary for each call, so, in that case compiler shall subtract 8 from %rsp on each enter.
1. You can figure out things about the assembly even without understanding assembly (e.g. lines of source translating into 0 lines of output vs many lines of output).
2. You have labels.
3. You can figure out some of the assembly on your own. Say: `mov %r1 %r2` - it probably moves what's in entity %r1 into a similar entity %r2, or vice-versa. etc.
4. You can see what the executable outputs
5. and most importantly: You can read compiler warnings and errors...
Most compiled languages have offered switches to generate Assembly.
Even on JVM and .NET there are ways to dump it, while on the various JVM implementations it requires a plugin if not using a debug build, on the .NET side, you can use show Assembly on Visual Studio, or make use of WinDBG with SOS plugin.
I think it has a few other languages supported now too. And/or there's equivalents for other languages.
I think most of the confabulation of C++ is necessary to get the semantics needed for it to work right. Especially with all of the optimizations that compilers are expected to make. I found the reasoning behind switching from just rvalues/lvalues to the 5+ types they have now to be fascinating, for example.
I've used Compiler Explorer for many years as a C++ developer. When I started working in HIP, I really missed having Compiler Explorer in my toolbox. I've been on leave for the past couple months and I took the opportunity to make some contributions outside my normal work. Consequentially, full support for compiling HIP to AMDGPU assembly was merged last week.
I hadn't realized it until I started working with him back in the day and saw his name on Slack.
Wrote him a message and told him how much we appreciated his work at my previous workplace and how we used it all the time to settle debates about C++ code.
Wonderful guy, back then he was very pleasantly surprised that people actually used his website.
It's great that you can now get RISC-V output. IMO RISC-V is the most pleasant way to learn assembly-level programming. For anyone interested here's a nice resource:
A bit unrelated, I've been learning ARM64 assembly out of the convenience of having an ARM machine.
It's also been pleasant, I've been planning on learning RISC-V next. The only device I have access to though is the ESP32C3, so I don't know how far I'll go with it.
Outside of decompiling some code on Godbolt, peeking into the assembly on VS Code, I've also been practising with Rust's inline assembly, quite pleasant.
Do you have a good tutorial or intro for ARM64 or Apple Silicon assembly? I'd like to learn but the books I have are all for MIPS, and the online resources are hit or miss.
I'm delighted to say we get a decent income from Patreon supporters and GitHub sponsors, as well as a little from commercial sponsors. It covers our running costs, and leaves some left over to save for contingency, and reward some contributors (myself included!).
I don't know what an Optiver salary would be (though I can guess: I work in finance too), but no, it's absolutely not a living wage :), lucky though I am to have anything for an Open Source project.
I do it as it's something I feel strongly about; it's got my name (and reputation!) staked on it (not as I'd planned it); and it's given me enormous opportunities too!
There's a CPPCon talk - which I'm sure somebody else can remember and link - where Matt explains how this started. IIRC He was initially wondering whether C++ iterators are actually the same price as the C-style for loops you'd once have used instead with an index counting up or a pointer increasing. If it was slower then in Matt's industry that's useless, but if it's the same speed then the improved clarity of what you mean is valuable.
Obviously the results will be identical but because the iterators look fancy (and are easier to think about) maybe there's some object getting constructed, it might be a lot heavier - right? Nope, same machine code. Matt initially did this much more manually, the Godbolt web site is just that same idea getting further enriched over time.
This is even more striking for something like Rust's iterators that don't look at all like it's just the C-style for loop, there's a call to make the iterator IntoIterator::into_iter() and then repeated calls to its next() method, sounds expensive - but nope, once again the optimiser can see what's going on here and emit the same machine code. Having a tool like Godbolt to confirm (or sometimes refute) the belief that these things are actually the same is really useful, as even after confirming that optimisation is needed if the proposed "optimisation" doesn't change the machine code it wasn't an optimisation, just making the program needlessly harder to understand.
> This is even more striking for something like Rust's iterators that don't look at all like it's just the C-style for loop, there's a call to make the iterator IntoIterator::into_iter() and then repeated calls to its next() method, sounds expensive - but nope, once again the optimiser can see what's going on here and emit the same machine code.
> Having a tool like Godbolt to confirm (or sometimes refute) the belief that these things are actually the same is really useful, as even after confirming that optimisation is needed if the proposed "optimisation" doesn't change the machine code it wasn't an optimisation, just making the program needlessly harder to understand.
I'm not sure I understand this. Wouldn't you expect higher-order code to be easier to optimise, since it comes closer to telling the compiler what you want to do, so that the compiler can figure it out, rather than forcing the compiler to divine the big-picture intention of a bunch of low-level instructions?
And, if high-level code generates exactly the same machine code as low-level code, isn't that an argument in favour of high-level code—it lets you code declaratively, saying what you mean—rather than against high-level code? An optimisation might be optimising for intelligibility, not just run-time … and, while an experienced low-level C programmer might find the low-level code more readable, surely the non-expert maintenance programmer who comes afterwards will be grateful not to have to recognise the low-level patterns but rather have them spelled out in high-level code?
> I'm not sure I understand this. Wouldn't you expect higher-order code to be easier to optimise, since it comes closer to telling the compiler what you want to do, so that the compiler can figure it out, rather than forcing the compiler to divine the big-picture intention of a bunch of low-level instructions?
This assumes a "sufficiently smart compiler", whereas if you write the exact low level basic for loop (or unroll it manually or whatever) you want you can be confident that even a pretty dumb compiler is still going to output something close to optimal.
It just seems like most compilers are "sufficiently smart" for this sort of thing nowadays (and have been for quite some time). But it's not always the case, and the code you think would obviously be optimized by the compiler isn't always. So it pays to check these things if you're writing code for something where potentially eking out extra performance really matters.
> Wouldn't you expect higher-order code to be easier to optimise, since it comes closer to telling the compiler what you want to do, so that the compiler can figure it out, rather than forcing the compiler to divine the big-picture intention of a bunch of low-level instructions?
Optimization is an NP-hard problem. What compiler backends do these days is mostly to pattern match known optimizable code blocks. Some of the other optimizations are an approximation of the actual solution. The order of optimization type being made also affects the result.
So in a perfect world where we could solve NP-hard problems, higher-level code (with more constraints put on it -- as in Rust traits, not C++ templates) would be easier to optimize. But since we don't live in that utopia, nope.
> > Wouldn't you expect higher-order code to be easier to optimise, since it comes closer to telling the compiler what you want to do, so that the compiler can figure it out, rather than forcing the compiler to divine the big-picture intention of a bunch of low-level instructions?
> Optimization is an NP-hard problem. What compiler backends do these days is mostly to pattern match known optimizable code blocks. Some of the other optimizations are an approximation of the actual solution. The order of optimization type being made also affects the result.
Right, and that's what I meant—although I certainly see why it sounded like I was referring to some infallible and perfect optimisation process.
To be precise—and sticking with the theme of iterators from my parent, though there's nothing particularly special except that it's a familiar pattern—if there's one high-level iterator construct, isn't it more likely that the average programmer will write each invocation of an iterator in the way that the compiler expects; whereas, if each user has to roll their own iterator, then different average programmers will roll different iterators, and it's more likely that a programmer will write something so baroque that the compiler doesn't realise it can apply a known optimisation?
What a nested rust iterator is doing, semantically, is creating a struct containing a struct containing a struct... that calls a method that calls a method that calls a method. The compiler has to do a lot of work to make that into anything approaching a C for loop. Just try running unoptimized rust code and you'll see.
I recently learned that it's not too difficult to run Compiler Explorer locally on your machine. Clone it from github and run the appropriate makefile and npm commands. Recently I've been using it Compiler Explorer a lot like this. It's easier to use your own header files and it's nice to not have the extra latency of compiling remotely or worry that you're using so many of their CPU cycles. The main downside is that if you run locally you can only test on the version of gcc and clang that you have installed, while on the website you can test other versions and also other CPU architectures.
Awesome tool. I do mostly c++ all day and I use this almost daily, and certainly weekly. I think it’s really improved my feel for what the compiler will do with different constructs.
I'd like to share a command-line tool to interact with Compiler Explorer that I made: https://github.com/xfgusta/cexpl. It's written in Python and it's available on PyPI.
Seriously asking, how is it even possible to submit this to hn at this point? When one submits a previously submitted link, doesn't it just alias to the previous submission?
Compiler Explorer is such a wonderful tool. It made examining and comparing compiler outputs so much easier and now pretty much everyone interested in optimizations is using it.
Because it's working with arrays of double which is 8 bytes wide, and the optimized loop doesn't use any scaled indexing; basically the loop index is pre-scaled to the element size. Perhaps those MMX instructions don't support addressing modes with index scaling (wild guess).
The unoptimized code increments by 1 to 65535, but the memory accesses use scaling. Well, not exactly. We see this:
This LEA here, though it means "load effective address" is not actually an effective address calculation. The base address is zero, and so this is just LEA being exploited to multiply RAX by 8, and get that into RDX. RAX is then clobbered with the base address of an array, to which the scaled displacement is added and then finally used to make an access.
Is there a trick to make the typescript compiler run? All the other examples seem fine out of the box. No matter what I write, the ts compilation fails.
So! The typescript compiler doesn't leave us (currently) with any asm. We can only execute what it produces: https://godbolt.org/z/YvKe8ojvT for example.
If you click "show binary" that's pretty much what we do. Here's a link doing so, and also running `elfdump -a` on the linked output: https://godbolt.org/z/9EcxhedK4
Its better than the others but also very unprofessional as the other web stuff. It DOES link but its absolutely impossible to view generated map. People are just hyping about codegen related microoptimizations but don't care about layout at all.
Very typical for today's world where javascript coder counts as software engineer.
General stuff: we're always looking for help; everything's open source on GH: https://github.com/compiler-explorer/ (the base project, our cloud setup, all our build scripts, etc). The most valuable way to help us is with issues and PRs, or hang out on our Discord (https://discord.gg/zNNgyRKh). Then spread the word, and last we welcome sponsors on GH (https://github.com/sponsors/mattgodbolt) or Patreon (https://www.patreon.com/mattgodbolt).