How is Ultrassembler so fast? (opens in new tab)

(jghuff.com)

124 pointsnetr0ute8mo ago51 comments

51 comments

> Additionally, in C++, requesting that heap memory also requires a syscall every time the container geometrically changes size

That is not true - no allocator I know of (and certainly not the default glibc allocator) allocates memory in this way. It only does a syscall when it doesn’t have free userspace memory to hand out but it overallocates that memory and also reuses memory you’ve already freed.

idiomat90008mo ago

Wasn't there also over allocate for the first geometric expansion and mark the 2nd as for space for likely shortlived objects?

aidenn08mo ago

Exceptions in C++ are never zero-overhead. There is a time-space tradeoff for performance of uncaught exceptions, and G++ picks space over time.

mpyne8mo ago

There's a time-space tradeoff to basically any means of error checking.

Including checking return codes instead of exceptions. It's even possible for exceptions as implemented by g++ in the Itanium ABI to be cheaper than the code that would be used for consistently checking return codes.

Joker_vD8mo ago

Actually, there has been some research into building exceptions on the "basically, passing std::exception* into every function and checking what's inside it on every return" idea, and it was about as fast as the traditional table-based unwinding, took way less space in the executable, and re-throwing exceptions was actually faster [0][1]

[0] https://news.ycombinator.com/item?id=22483028

[1] https://www.research.ed.ac.uk/portal/files/78829292/low_cost...

mpyne8mo ago

Yeah, it comes down to modeling assumptions on how many different types of exceptions can be thrown, how many actually are thrown, and the shape of the control flow graph of a program at runtime.

You can find one style outperforms the other based on the circumstances of the program, and programmers worried about optimization may someday be able to choose between approaches to meet their performance goals instead of pretending that tables are inherently slow and return codes that they won't even fully implement are inherently fast.

My point was simply that you can't just say "oh but exceptions are not zero-cost" without actually comparing to the alternative of laboriously carting return codes all through the call graph, as done in the research you show here and as also done by Khalil Estell elsewhere for ARM embedded.

netr0uteOP8mo ago

> G++ picks space over time

By definition, that's zero-overhead because Ultrassembler doesn't care about space.

aidenn08mo ago

Okay, than a traditional setjmp/longjmp implementation is zero-overhead because I don't care about space or time!

netr0uteOP8mo ago

Hi everyone, I'm the author of this article.

Feel free to ask me any questions to break the radio silence!

benreesman8mo ago

Nice work and good writeup. I think most of that is very sound practice.

The codegen switch with the offsets is in everything, first time I saw it was in the Rhino JS bytecode compiler in maybe 2006, written it a dozen times since. Still clever you worked it out from first principles.

There are some modern C++ libraries that do frightening things with SIMD that might give your bytestring stuff a lift on modern stupid-wide high mispredict penalty stuff. Anything by lemire, stringzilla, take a look at zpp_bits for inspiration about theoretical minimum data structure pack/unpack.

But I think you got damn close to what can be done, niiicccee work.

Sesse__8mo ago

FWIW, this is basically an implementation of perfect hashing, and there's a myriad of different strategies. Sometimes “switch on length + well-chosen characters” are good, sometimes you can do better (e.g. just looking up in a table instead of a long if chain).

The “value speculation” thing looks completely weird to me, especially with the “volatile” that doesn't do anything at all (volatile is generally a pointer qualifier in C++). If it works, I'm not really convinced it works for the reason the author thinks it works (especially since it refers to an article talking about a CPU from the relative stone age).

inetknght8mo ago

Overall, this is a fantastic dive into some of RISC-V's architecture and how to use it. But I do have some comments:

> However, in Chata's case, it needs to access a RISC-V assembler from within its C++ code. The alternative is to use some ugly C function like system() to run external software as if it were a human or script running a command in a terminal.

Have you tried LLVM's C++ API [0]?

To be fair, I do think there's merit in writing your own assembler with your own API. But you don't necessarily have to.

I'm not likely to go back to assembly unless my employer needs that extra level of optimization. But if/when I do, and the target platform is RISC-V, then I'll definitely consider Ultraseembler.

> It's not clear when exactly exceptions are slow. I had to do some research here.

There are plenty of cppcon presentations [1] about exceptions, performance, caveats, blah blah. There's also other C++ conferences that have similar presentations (or even, almost identical presentations because the presenters go to multiple conferences), though I don't have a link handy because I pretty much only attend cppcon.

[0]: https://stackoverflow.com/questions/10675661/what-exactly-is...

[1]: https://www.youtube.com/results?search_query=cppcon+exceptio...

netr0uteOP8mo ago

> LLVM's C++ API

I think I read something about this but couldn't figure out how to use it because the documentation is horrible. So, I found it easier to implement my own, and as it turns out, there are a few HORRIBLE bugs in the LLVM assembler (from cross reference testing) probably because nobody is using the C++ API.

> There are plenty of cppcon presentations [1] about exceptions, performance, caveats, blah blah.

I don't have enough time to watch these kinds of presentations.

mpyne8mo ago

A specific presentation I'd point to is Khalil Estell's presentation on reducing exception code size on embedded platforms at https://www.youtube.com/watch?v=bY2FlayomlE

But honestly you'd get vast majority of the benefit just by skimming through the slides at https://github.com/CppCon/CppCon2024/blob/main/Presentations...

With a couple of symbols you define yourself a lot of the associated g++ code size is sharply reduced while still allowing exceptions to work. (Slide 60 on)

0x988mo ago

> I think I read something about this but couldn't figure out how to use it because the documentation is horrible.

Fair enough.

> So, I found it easier to implement my own, and as it turns out, there are a few HORRIBLE bugs in the LLVM assembler (from cross reference testing)

Interesting claim, do you have any examples?

inetknght8mo ago

> I don't have enough time to watch these kinds of presentations.

Then let me pick and share some of my favorites that I found enlightening, and summarize with some information that I found useful.

By far, the most useful one is Khalil Estell's presentation last year [0]. It's a fairly face paced but relatively deep dive into exception mechanics. At the end, he advocates for a new tool that would audit a program to determine what exceptions could be thrown. I think that's a flipping fantastic idea for a tool. Unfortunately I haven't seen any progress toward it -- if someone here knows where his tool is, or a similar tool, please reply! I did send him an email a few months ago inquiring about it, but haven't received a reply. Nonetheless, the whole presentation was excellent in my opinion. I did see that he had another related presentation at ACCU this year [4] with a topic of "C++ Exceptions are Code Compression" (which I totally can believe -- I've seen it myself in binary sizes), but I haven't seen his presentation yet. I'll watch it later today.

Just about anything from Herb Sutter is good. I don't like that he works for Microsoft, but he does great stuff for C++, including the old Guru of the Week series [1]. In particular, his 2019 presentation [2] describes different error handling techniques, some difficulties and pitfalls in combining libraries with different error handling techniques, and leads up to explaining why std::expected came about. He does pontificate a lot though, so the presentation is fairly high level and slow paced.

Dave Watson's 2017 presentation [3] dives into a few different implementations of stack unwinding. It's good to understand how different compilers implement exceptions with low- or zero-cost overhead and what that "overhead" is really measuring.

So, there's about a half of a day of presentations to watch here. I hope that's not too much for you.

[0]: https://www.youtube.com/watch?v=bY2FlayomlE

[1]: https://herbsutter.com/gotw/

[2]: https://www.youtube.com/watch?v=ARYP83yNAWk

[3]: https://www.youtube.com/watch?v=_Ivd3qzgT7U

[4]: https://www.youtube.com/watch?v=LorcxyJ9zr4

1 more reply

NooneAtAll38mo ago

isn't your MemoryBank already somewhere in std::pmr?

If I'm honest, I've never looked into pmr, but I always thought that that's where std has arena allocators and stuff

https://en.cppreference.com/w/cpp/header/memory_resource.htm...

msla8mo ago

What's the difference between a Programming Furu and a Programming Guru? Is there a joke I'm missing?

netr0uteOP8mo ago

Furus are "fake gurus." It comes from the Fintwit space where "furus" share their +1000% option trades as if they're geniuses in order to get you to sign up for their expensive Substack.

jclarkcom8mo ago

You might look into using memory mapped IO for reading input and writing your output files. This can save some memory allocations and file read and write times. I did this with a project where I got more than 10x speed up. For many cases file IO is going to be your bottleneck.

Sesse__8mo ago

mmap-based I/O still needs to go through the kernel, including memory allocation (in the page cache) and all. If you've got 10x speedup from mmap, it is usually because your explicit I/O was very inefficient; there are situations where mmap is useful, but it's rarely a high-performance strategy, as it's really hard for it to guess what your intended I/O patterns are just from the page faults it's seeing.

jclarkcom8mo ago

Windows uses memory mapped IO for loading all executable processes because it allows you to start executing a process after loading a few pages even if the exe is megabytes. You can use the same to reduce latency for starting to assemble data before the rest of the file loads, the rest can be loaded using more efficienct asynchronous mechanisms. Using for output also means your process doesnt waits on flushes that is also async. And in memory constrained environments the OS doesn’t have to write your data to swap, it can just reload it from the meeting mapped file.

1 more reply

IshKebab8mo ago

Neat, but it's not like assembly is really a bottleneck in any but the most extreme cases. LLVM and GAS are already very fast.

I feel like this might mostly be useful as a reference, because currently RISC-V assembly's specification is mostly "what do GCC/Clang do?"

drob5188mo ago

Exactly. I don’t know too many assembly language programmer's who are griping about slow tools, particularly on today’s hardware. Yea, Orca/M on my old Apple II with 64k RAM and floppy drives was pretty slow, but since then not so much. But sure, as a fun challenge to see how fast you can make it run, go for it.

CyberDildonics8mo ago

ASM should compile at hundreds of MB/s. All the ASM you could write in your entire life will compile instantly. There is no one in decades that has thought their assembler is too slow.

benreesman8mo ago

ptxas comes to mind.

gdiamos8mo ago

ptxas is a bit of a misnomer - it actually wraps the entire NVIDIA driver backend compiler

PTX isn’t the assembly language, it is a virtual ISA, so you need a full backend compiler with 10s to 100s of passes to get to machine code

benreesman8mo ago

I appreciate that hitting sm_70 through sm_120 in one call isn't the same as hitting RISC-V in one call, but I do a lot of builds just for sm_120 which is closer to a fair comparison.

It's imperfect, but I take any excuse to point out how bad monopolies are for customers. All you have to do is build the driver to see that "low priority" is a pretty broad term on the allegedly elite trillion dollar toolchain.

I'm not saying CUDA is unimpressive, its a very, very, very hard problem. But if they were in an uncorrupted market ptxas would be fast instead of devastating znver5 workstations with 6400MT DDR5.

throwaway815238mo ago

I wonder if you thought about perfect hashing instead of that comparison tree. Also, flex (as in flex and bison) can generate what amounts to trees like that, I believe. I haven't benchmarked it compared to a really careful explicit tree though.

netr0uteOP8mo ago

I thought about hashing, but found that hashing would be enormously slow to compute compared to a perfectly crafted tree.

dafelst8mo ago

But did you think about using a perfect hash function and table? Based on my prior research, it seems like they are almost universally faster on small strings than trees and tries due to lower cache miss rates.

dist1ll8mo ago

Ditto. Perfect hashing strings smaller than 8 bytes has been the fastest lookup method in my experience.

1 more reply

Sesse__8mo ago

You're probably thinking of gperf, not flex and bison.

sylware8mo ago

Oh, I remember I did a plain and simple C port of an old gperf, cgperf https://www.rocketgit.com/user/sylware/cgperf

Ofc, I did add my own bugs.

throwaway815238mo ago

I meant flex, for generating a switch table for that type of lexer. gperf is for hashing which is different. But, there may be better methods by now since the field has changed a lot.

StilesCrisis8mo ago

“Here's one weird trick I haven't seen anywhere else.” … describes a simplistic lexer. Hmm.

throwaway815238mo ago

I also have to ask where all this assembly code is coming from, that has to be compiled fast. If it's compiler output, maybe you could hack the compiler back end to generate tokenized assembly code, eliminating a lot of scanning and stuff. It would still be human readable through a simple program that converted the tokens back to mnemonics. The tokens could be 4 digit hex numbers or something like that, so it would still be an easily handled text file.

Lots of simple compilers generate object code directly instead of assembly code, so the above is not so bad by comparison.

stuaxo8mo ago

Nice to have something really fast, maybe we can have something as good as TurboPascal and other early Borland tools again.

j / k navigate · click thread line to collapse

51 comments

vlovich1238mo ago

> Additionally, in C++, requesting that heap memory also requires a syscall every time the container geometrically changes size

idiomat90008mo ago

Wasn't there also over allocate for the first geometric expansion and mark the 2nd as for space for likely shortlived objects?

aidenn08mo ago

Exceptions in C++ are never zero-overhead. There is a time-space tradeoff for performance of uncaught exceptions, and G++ picks space over time.

mpyne8mo ago

There's a time-space tradeoff to basically any means of error checking.

Joker_vD8mo ago

[0] https://news.ycombinator.com/item?id=22483028

[1] https://www.research.ed.ac.uk/portal/files/78829292/low_cost...

mpyne8mo ago

Yeah, it comes down to modeling assumptions on how many different types of exceptions can be thrown, how many actually are thrown, and the shape of the control flow graph of a program at runtime.

netr0uteOP8mo ago

> G++ picks space over time

By definition, that's zero-overhead because Ultrassembler doesn't care about space.

aidenn08mo ago

Okay, than a traditional setjmp/longjmp implementation is zero-overhead because I don't care about space or time!

netr0uteOP8mo ago

Hi everyone, I'm the author of this article.

Feel free to ask me any questions to break the radio silence!

benreesman8mo ago

Nice work and good writeup. I think most of that is very sound practice.

But I think you got damn close to what can be done, niiicccee work.

Sesse__8mo ago

inetknght8mo ago

Overall, this is a fantastic dive into some of RISC-V's architecture and how to use it. But I do have some comments:

Have you tried LLVM's C++ API [0]?

To be fair, I do think there's merit in writing your own assembler with your own API. But you don't necessarily have to.

I'm not likely to go back to assembly unless my employer needs that extra level of optimization. But if/when I do, and the target platform is RISC-V, then I'll definitely consider Ultraseembler.

> It's not clear when exactly exceptions are slow. I had to do some research here.

[0]: https://stackoverflow.com/questions/10675661/what-exactly-is...

[1]: https://www.youtube.com/results?search_query=cppcon+exceptio...

netr0uteOP8mo ago

> LLVM's C++ API

> There are plenty of cppcon presentations [1] about exceptions, performance, caveats, blah blah.

I don't have enough time to watch these kinds of presentations.

mpyne8mo ago

A specific presentation I'd point to is Khalil Estell's presentation on reducing exception code size on embedded platforms at https://www.youtube.com/watch?v=bY2FlayomlE

But honestly you'd get vast majority of the benefit just by skimming through the slides at https://github.com/CppCon/CppCon2024/blob/main/Presentations...

With a couple of symbols you define yourself a lot of the associated g++ code size is sharply reduced while still allowing exceptions to work. (Slide 60 on)

0x988mo ago

> I think I read something about this but couldn't figure out how to use it because the documentation is horrible.

Fair enough.

> So, I found it easier to implement my own, and as it turns out, there are a few HORRIBLE bugs in the LLVM assembler (from cross reference testing)

Interesting claim, do you have any examples?

inetknght8mo ago

> I don't have enough time to watch these kinds of presentations.

Then let me pick and share some of my favorites that I found enlightening, and summarize with some information that I found useful.

So, there's about a half of a day of presentations to watch here. I hope that's not too much for you.

[0]: https://www.youtube.com/watch?v=bY2FlayomlE

[1]: https://herbsutter.com/gotw/

[2]: https://www.youtube.com/watch?v=ARYP83yNAWk

[3]: https://www.youtube.com/watch?v=_Ivd3qzgT7U

[4]: https://www.youtube.com/watch?v=LorcxyJ9zr4

1 more reply

NooneAtAll38mo ago

isn't your MemoryBank already somewhere in std::pmr?

If I'm honest, I've never looked into pmr, but I always thought that that's where std has arena allocators and stuff

https://en.cppreference.com/w/cpp/header/memory_resource.htm...

msla8mo ago

What's the difference between a Programming Furu and a Programming Guru? Is there a joke I'm missing?

netr0uteOP8mo ago

Furus are "fake gurus." It comes from the Fintwit space where "furus" share their +1000% option trades as if they're geniuses in order to get you to sign up for their expensive Substack.

jclarkcom8mo ago

Sesse__8mo ago

jclarkcom8mo ago

1 more reply

IshKebab8mo ago

Neat, but it's not like assembly is really a bottleneck in any but the most extreme cases. LLVM and GAS are already very fast.

I feel like this might mostly be useful as a reference, because currently RISC-V assembly's specification is mostly "what do GCC/Clang do?"

drob5188mo ago

CyberDildonics8mo ago

ASM should compile at hundreds of MB/s. All the ASM you could write in your entire life will compile instantly. There is no one in decades that has thought their assembler is too slow.

benreesman8mo ago

ptxas comes to mind.

gdiamos8mo ago

ptxas is a bit of a misnomer - it actually wraps the entire NVIDIA driver backend compiler

PTX isn’t the assembly language, it is a virtual ISA, so you need a full backend compiler with 10s to 100s of passes to get to machine code

benreesman8mo ago

I appreciate that hitting sm_70 through sm_120 in one call isn't the same as hitting RISC-V in one call, but I do a lot of builds just for sm_120 which is closer to a fair comparison.

I'm not saying CUDA is unimpressive, its a very, very, very hard problem. But if they were in an uncorrupted market ptxas would be fast instead of devastating znver5 workstations with 6400MT DDR5.

throwaway815238mo ago

netr0uteOP8mo ago

I thought about hashing, but found that hashing would be enormously slow to compute compared to a perfectly crafted tree.

dafelst8mo ago

dist1ll8mo ago

Ditto. Perfect hashing strings smaller than 8 bytes has been the fastest lookup method in my experience.

1 more reply

Sesse__8mo ago

You're probably thinking of gperf, not flex and bison.

sylware8mo ago

Oh, I remember I did a plain and simple C port of an old gperf, cgperf https://www.rocketgit.com/user/sylware/cgperf

Ofc, I did add my own bugs.

throwaway815238mo ago

I meant flex, for generating a switch table for that type of lexer. gperf is for hashing which is different. But, there may be better methods by now since the field has changed a lot.

StilesCrisis8mo ago

“Here's one weird trick I haven't seen anywhere else.” … describes a simplistic lexer. Hmm.

throwaway815238mo ago

Lots of simple compilers generate object code directly instead of assembly code, so the above is not so bad by comparison.

stuaxo8mo ago

Nice to have something really fast, maybe we can have something as good as TurboPascal and other early Borland tools again.

j / k navigate · click thread line to collapse