Fast linked lists (opens in new tab)

(dygalo.dev)

182 pointsdmitry_dygalo2y ago127 comments

127 comments

It strikes me the bottleneck for this problem isn't Vec or List, it's the serde_json Value type that needs to be used. This is useful for serializing/deserializing values into Rust types but if you're trying to validate JSON against a schema you don't actually need the JSON value data, just the types of the nodes (or more specifically, you only need some of the value data, and probably not much of it, so don't pay for it when you don't have to).

If you implemented your own parser for the schema and the JSON and only used an AST to validate + span information (which can just be a pair of u16s for start/end of a token) then you can collect your error messages very, very quickly and generate the error message once validation is finished.

Heavily optimized compilers will do this for semantic analysis and type checking, where the IR they use can be constructed quickly and then used with the input text to get helpful messages out, while the guts of the algorithm is only working with semantic information in a structure that's compact and easy to access/manipulate.

All that said, serde_json is incredibly convenient and giving up to write your own parser is a big hammer for a problem that probably doesn't need it.

ComputerGuru2y ago

> All that said, serde_json is incredibly convenient and giving up to write your own parser is a big hammer for a problem that probably doesn't need it.

I had a thought in my reply [0] on this that actually might let him eat his cake and have it too in this regard. I think you can heavily abuse serde::de::Visitor to schema validate without actually parsing (or with less parsing, at any rate). I went into more detail in my comment but I wanted to ping you (@duped).

[0]: https://news.ycombinator.com/item?id=40357159

aldanor2y ago

Or, use sonic_rs::Value which uses simd and arena allocation for keys and in my case for big payloads was 8x faster than serds_json.

Or, use sonic_rs::LazyValue if you want to parse json on demand as you traverse it.

ComputerGuru2y ago

Hey, that's a nifty looking library. Thanks for pointing it out to me!

ww5202y ago

I've seen a lexer/parser scheme that encodes the lexer token type along with the token file location information into a u64 integer, something like

    struct Token {
        token_type: u8,
        type_info: u8,
        token_start: u32,    // offset into the source file.
        token_len: u16
    }

It's blazing fast. The lexer/parser can process millions of lines per second. The textual information is included, and the location information is included.

acidx2y ago

This is roughly what my JSON parser does. It does type-checking, albeit without using JSON-schema, but an object descriptor that you have to define to parse and generate JSON.

It's been developed for embedded systems (it was written originally for a NATS implementation in the Zephyr RTOS), so it's a bit limited and there's no easy way to know where some parsing/type validation error happened, but the information is there if one wants to obtain it: https://github.com/lpereira/lwan/blob/master/src/samples/tec...

duped2y ago

I think the key insight is that the true benchmark is bytes/second (bandwidth) of the lexer/parser, so reducing the size of the output data (tokens/AST nodes) is a massive gain in the amount of data that you can process in the same amount of time.

The fewer bytes you can pack data into the more data that you can process per second. Computers may be the most complex machines ever built but the simple fact of having fewer things to touch means you can touch more things in the same amount of time remains true.

dmitry_dygaloOP2y ago

That is really cool idea! Thank you

EGreg2y ago

Just use capn’proto. No deserialization needed !

aabhay2y ago

Whats your experience like using it? Is it ergonomic or does it require you to do lots of type gymnastics?

cabronerp2y ago

This repo has a nice pub/sub implementation based on capnp: https://github.com/commaai/cereal/blob/master/log.capnp

evmar2y ago

The "maybe you don't need a linked list" proposal at the bottom seems significantly better than the options presented in the post:

- almost no cost in the non-erroring path

- no extra data structures to manage

- a lot less code

I think the post would benefit from a better analysis of why this doesn't work for them.

dmitry_dygaloOP2y ago

Indeed, I agree with your points.

This idea was added after I wrote the post and wasn't taken from my own optimization efforts in `jsonschema`. Originally, in `jsonschema` the output type is actually a badly composed iterator and I intended to simplify it to just a `Result<(), ValidationError>` for the article, but with this output, there are actually way better optimizations than I originally implemented.

If I'd discovered this idea earlier, I'd probably spend more time investigating it.

vouwfietsman2y ago

Indeed, also building a linked list over the stack like that is a crafty but very weird design. Keep it simple.

ComputerGuru2y ago

Nice post, Dmitry!

Two suggestions: it’s not immediately obvious whether subsequent benchmark result tables/rows show deltas from the original approach or from the preceding one (it’s the latter, which is more impressive). Maybe call that out the first couple of times?

Second, the “using the single Vec but mutating it” option would presumably benefit from a reserve() or with_capacity() call. Since in that approach you push to the vector in both erroring and non-erroring branches, it doesn’t have to be exact (though you could do a bfs search to find maximum depth, that doesn’t strike me as a great idea) and could be up to some constant value since a single block memory allocation is cheap in this context. (Additionally, the schema you are validating against defines a minimum depth whereas the default vec has a capacity of zero, so you’re guaranteed that it’s a bad choice unless you’re validating an empty, invalid object.)

But I agree with the sibling comment from @duped that actually parsing to JSON is the biggest bottleneck and simply parsing to the minimum requirements for validation would be far cheaper, although it depends on if you’ll be parsing immediately after in case it isn’t invalid (make the common case fast, assuming the common case here is absence of errors rather than presence of them) or if you really do just want to validate the schema (which isn’t that rare of a requirement, in and of itself).

(Edit: I do wonder if you can still use serde and serde_json but use the deserialize module’s `Visitor` trait/impl to “deserialize” to an enum { Success, ValidationError(..) }` so you don’t have to write your own parser, get to use the already crazy-optimized serde code, and still avoid actually fully parsing the JSON in order to merely validate it.)

If this were in the real world, I would use a custom slab allocator, possibly from storage on the stack rather than the heap, to back the Vec (and go with the last design with no linked lists whatsoever). But a compromise would be to give something like the mimalloc crate a try!

dmitry_dygaloOP2y ago

Thanks!

> Two suggestions: it’s not immediately obvious whether subsequent benchmark result tables/rows show deltas from the original approach or from the preceding one (it’s the latter, which is more impressive). Maybe call that out the first couple of times?

Agree!

> Second, the “using the single Vec but mutating it” option would presumably benefit from a reserve() or with_capacity() call. Since in that approach you push to the vector in both erroring and non-erroring branches, it doesn’t have to be exact (though you could do a bfs search to find maximum depth, that doesn’t strike me as a great idea) and could be up to some constant value since a single block memory allocation is cheap in this context. (Additionally, the schema you are validating against defines a minimum depth whereas the default vec has a capacity of zero, so you’re guaranteed that it’s a bad choice unless you’re validating an empty, invalid object.)

Oh, this is a cool observation! Indeed it feels like `with_capacity` would help here

> But I agree with the sibling comment from @duped that actually parsing to JSON is the biggest bottleneck and simply parsing to the minimum requirements for validation would be far cheaper, although it depends on if you’ll be parsing immediately after in case it isn’t invalid (make the common case fast, assuming the common case here is absence of errors rather than presence of them) or if you really do just want to validate the schema (which isn’t that rare of a requirement, in and of itself).

My initial assumption was that usually the input is already parsed. E.g. validating incoming data inside an API endpoint which is then passed somewhere else in the same representation. But I think that is a fair use case too and I was actually thinking of implementing it at some point via a generic `Json` trait which does not imply certain representation.

> (Edit: I do wonder if you can still use serde and serde_json but use the deserialize module’s `Visitor` trait/impl to “deserialize” to an enum { Success, ValidationError(..) }` so you don’t have to write your own parser, get to use the already crazy-optimized serde code, and still avoid actually fully parsing the JSON in order to merely validate it.)

Now when I read the details, it feels like a really really cool idea!

> If this were in the real world, I would use a custom slab allocator, possibly from storage on the stack rather than the heap, to back the Vec (and go with the last design with no linked lists whatsoever). But a compromise would be to give something like the mimalloc crate a try!

Nice! In the original `jsonschema` implementation the `validate` function returns `Result<(), ErrorIter>` which makes it more complex to apply that approach, but I think it still should be possible.

ertucetin2y ago

7-8 years ago I created GlueList (https://github.com/ertugrulcetin/GlueList) in order to build faster version of LinkedList + ArrayList. It was a fun effort.

duped2y ago

If I have this right, what you've built is this, storing M items in M / N nodes where N is the radix of the array?

    0                 1                        M / Nth node
    [ N elem ] [.] -> [N elem] [.] -> .... -> [M - (M / N) elem] [null]

And so if you want to index the `i`th element you chase the pointers

    index (node i) :
        acc = 0
        while acc + N < i
           acc += N
           node = node.next
        return node.array[j - acc]

Or something like that? You can do better using a B tree and exploit the fact that you don't need keys to just store pointers in the branch nodes. This reduces the number of pointer dereferences in large arrays.

For example say you have N = 8 and there are 1024 elements in the list, you would need 127 pointer dereferences to reach the end of the array. In a B-tree you only have 3. (double check my math, I might be off by one).

This is the typical implementation for immutable/persistent vector types. If you keep the links between leave nodes at the bottom of the tree, you've got a B+ tree and have fast iteration through it as well.

ertucetin2y ago

You're probably right, I was in college back then and wanted to work on something cool, this was my idea.

amelius2y ago

It was fun because you didn't write it in Rust :)

https://rust-unofficial.github.io/too-many-lists/

twic2y ago

Similar to an unrolled linked list: https://en.wikipedia.org/wiki/Unrolled_linked_list

But you size the nodes dynamically, rather than using a fixed size.

boxed2y ago

I believe that's called a "rope"

vbezhenar2y ago

Rope is binary tree of arrays, for O(logN) index operation.

boxed2y ago

Ah, my bad. What's the name for the thing he's talking about?

1 more reply

vlovich1232y ago

Intrusive linked lists might bring down the allocations further, reduce the memory footprint, & more importantly improve locality when doing pointer chasing (single predictable indirection vs double hard-to-predict indirection).

torusle2y ago

> Linked lists are taught as fundamental data structures in programming courses, but they are more commonly encountered in tech interviews than in real-world projects.

I beg to disagree.

In kernels, drivers, and embedded systems they are very common.

saghm2y ago

Most people who take data structures courses or perform tech interviews don't end up working on kernels, drivers, or embedded systems though. To me, it sounds like the point being made is that there are a large number of programmers who have learned about linked lists but haven't run into many cases where they needed them in the world world, and I think it's accurate.

dmitry_dygaloOP2y ago

This was my intention

SoftTalker2y ago

Agree, I can't recall using anything more complicated than lists/arrays or hash tables (key/value stores) in practice, in many years of (mostly web application) programming. And even those I'm not coding from scratch, I'm using classes or functions that my programming language gives me. For anything more complicated than that, I'm using a database, which of course is using many data structures under the covers but I don't directly touch those.

1 more reply

samatman2y ago

You need a linked list to write hello world in any Lisp, though.

Seems like the glaring exception to the rule!

lispm2y ago

As source code, but not necessarily as running code.

SBCL:

    * (defun hello-world () (write-string "hello world"))
    HELLO-WORLD

    * (disassemble #'hello-world)
    ; disassembly for HELLO-WORLD
    ; Size: 36 bytes. Origin: #x100311C85C                        ; HELLO-WORLD
    ; 5C:       AA0A40F9         LDR R0, [THREAD, #16]            ; binding-stack-pointer
    ; 60:       4A0B00F9         STR R0, [CFP, #16]
    ; 64:       EAFDFF58         LDR R0, #x100311C820             ; "hello world"
    ; 68:       570080D2         MOVZ NARGS, #2
    ; 6C:       29EC80D2         MOVZ TMP, #1889
    ; 70:       BE6B69F8         LDR LR, [NULL, TMP]              ; WRITE-STRING
    ; 74:       DE130091         ADD LR, LR, #4
    ; 78:       C0031FD6         BR LR
    ; 7C:       E00120D4         BRK #15                          ; Invalid argument count trap

The actual code for this example is machine code (which references a string, which is a vector), here without linked lists.

kazinator2y ago

No, you don't need to use linked lists to send a string to the standard output port in most Lisps. You just call a function.

1 more reply

ComputerGuru2y ago

Really only because they’re so goddamn easy. I find myself using linked lists a lot less since adopting rust for embedded code (even with no_std and no allocator, but especially when alloc-only std data structures are within reach).

wesnerm22y ago

Linked lists were heavily used in application software before the appearance of standard libraries and Java, which is when dynamically sizable array-based lists become common. There also wasn't a gap between the performance of linked lists and arrays before CPU became significantly faster than RAM.

hi-v-rocknroll2y ago

Modern processor and cache performance lend themselves to vectors and SSA. Linked lists just don't scale well outside of niche uses.

ziddoap2y ago

>In kernels, drivers, and embedded systems they are very common.

Out of all the programmers in the world, what percentage of them do you think work in the kernel/driver/embedded spaces?

sfink2y ago

First, my only guess is that everyone's guesses are going to be wildly wrong. People who work in such spaces will greatly overestimate. People who don't will greatly underestimate. (This is mostly due to how many comments I've read on HN that implicitly assume that most people's problems and perspectives are the same as the commenter's.)

Second, linked lists are useful in a lot more places than that. Probably a better proxy would be low-level coders. You almost always want a linked list somewhere when you're dealing with memory addresses and pointers. Maybe not for the primary collections, but there are always secondary ones that need to maintain a collection with a different membership or ordering, and vectors of pointers don't have many clear advantages over intrusive linked lists for those secondary collections.

josephg2y ago

Yeah intrusive collections in C is the biggest use I’ve seen. I played with a physics engine a few years ago (chipmunk2d) which made heavy use of intrusive linked lists to store all the objects in the world model. I suspect there’s some clever data structures out there that might have better performance, but the intrusive linked list approach was simple and fast!

96592y ago

coldtea2y ago

More like 0.01% -- if we consider enterprise programmers, web programmers, and application/game programmers which I'd expect to be the largest groups...

2 more replies

TheCondor2y ago

There are plenty of good uses for linked list and their variants. Like LRU lists come to mind; I couldn't bet that it's the most efficient way to implement them but they're pretty darn good. Then obviously things like breadth first search need a type of queue data structure. It often can come down to memory pressure, if you've got Gigs to spare, then allocating a contiguous block of memory for a list of something isn't a big deal, if memory is tight and maybe fragmented, linked lists can and will get it done. They have their places.

I did start to encounter some fresh grads with degrees that said "computer science" on them that couldn't answer some basic linked list questions. I was beginning to think it was a bad set of questions until I hit those kids. If you claim to know "computer science" and don't know what a linked list is, especially beyond some text books stuff, I'm probably not interested.

_xnmw2y ago

Why? Why would someone reach for a linked list in a kernel, driver, or embedded system?

sratner2y ago

No memory allocation/reallocation, preallocated resources managed in e.g. a free list. Also for things like packetized networks, lists are handy for filling as you progress down the stack while using fixed sized packet buffers, or reassembling fragments.

In embedded world, memory often needs to be exactly controlled, and allocation failures are fatal without a more complex MMU. In kernel world, I believe the main reason is that allocations can block.

cyberax2y ago

In kernels, it's usually hard to get general-purpose allocation working reliably in all contexts. And you need that for resizable vectors. With lists, you just need to be able to grab an element-sized block. Quite often, it's even done with the memory page granularity.

In addition, a lot of data structures might be shared across multiple cores. Linked lists can be traversed and mutated concurrently (although with a bit of care).

dist1ll2y ago

I wonder how much of that is due to the kernel history, and the influence of C idioms, and not because of some inherent design superiority.

I'd be convinced once I see pure Rust kernels geared towards modern machines suddenly using linked lists everywhere. Otherwise I'm leaning towards it being a side-effect of the language choice and culture.

Also because I've seen the same kind of reasoning applied to compilers (e.g. "of course you need linked lists in compilers, they are extremely graph traversal heavy"). But one look at modern compilers implemented in Rust paint a very different picture, with index-based vectors, data-oriented design and flattened ASTs everywhere.

1 more reply

kevingadd2y ago

Intrusive lists are really powerful for those kinds of scenarios, and technically are linked lists. They're widely used in the kernel, IIRC.

akira25012y ago

O(n) iteration but pretty much guaranteed O(1) for every other operation. If that's the semantic you need, then linked lists are your friend.

vineyardlabs2y ago

Any time you have a computer interacting with the outside world in an asynchronous fashion you basically have to have some form of buffering which takes the form of a queue/fifo. A linked list is the most performant/natural way of modeling a queue in our ubiquitous computing infrastructure.

I/e in a DMA-based ethernet driver, the ethernet MAC receives packets asynchronously from the processor, perhaps faster than the processor can ingest them. So the mac interrupts the processor to give it new packets, and the processor can't sit processing the packets in the interrupt context, so it needs to put them into some ordered list for processing later when it has downtime. In a true embedded system, the memory for this list is going to be fixed or statically allocated, but you still don't really want to have an array-style list with fixed indexing, as you'll have to manage what happens when the index wraps around back to 0 etc, so instead you just construct a linked list in that pre-allocated memory.

I wouldn't say linked lists aren't really used in high-level applications, as I said they're used all over the place whenever you have external asynchronous communication, it's just that modern high-level frameworks/libs totally abstract this away from most people writing high level code.

1 more reply

dmitry_dygaloOP2y ago

Easier to avoid allocation errors, e.g. in the Linux kernel. I think Alice Ryhl mentioned it here - https://www.youtube.com/watch?v=CEznkXjYFb4

SJC_Hacker2y ago

How do linked list prevent allocation errors? If anything it would seem to make them worse.

My experience in embedded, everything is hardcoded as a compile time constant, including fixed size arrays (or vectors of a fixed capacity)

3 more replies

dmitry_dygaloOP2y ago

I'd say most developers don't write kernels/drivers or embeds, at least from what I've seen. I am not saying that there are not many devs like this, but rather that there are fewer kernel devs than web devs.

hi-v-rocknroll2y ago

I beg to disagree^2. Tasks, threads, and processes are often structured as rings where there is always a "next" to maintain simplicity of task switching. The overall architecture of resources is modelable as cyclic graphs but implemented as rings, deques, single LLs, and other data structures.

davexunit2y ago

I don't do any of those things and I still use lists constantly. Kinda strange to learn that many others don't use them at all it seems.

nequo2y ago

What kinds of things are you using them for usually? Is it mostly in C/C++?

waynesonfire2y ago

linked lists shine when you can perform a O(1) remove operation if you have a reference to an object on the list. This is very common when using C structs and not possible in Java for example.

adgjlsfhk12y ago

these cases are usually cases where you want to use a (hash) Set. If you're Ok changing everyone's indexing, the indexing didn't matter.

kolbe2y ago

Linked list benchmarks are amazing.... if you don't thrash your cache on inserts so all its elements are contiguous. You get all the benefits of a vector and a linked list, without the reality that linked lists mostly don't get populated 100% consecutively, and thus can be anywhere in memory.

bjoli2y ago

Most languages with linked lists as an important part (lisps mostly) all have well optimized linked lists that end up with a lot better memory locality.

wavemode2y ago

How does that work? I don't follow how any runtime or compile-time optimization can solve the problem of locality for a linked list. If the data wasn't allocated sequentially, then it's simply not going to be sequential (unless you move it).

jerf2y ago

Instead of thinking of a linked list structurally, think of it functionally. You can have a token that represents a location in the linked list. From that token you have a Fetch() operation that will fetch what is at that location, a Next() operation that will fetch the next token or some indication that you are done, and depending on your language, some sort of insert or append operation (mutability factors in here).

While there is a natural encoding of this process into a pointer to the target value, and a pointer to the next node, it is not the only encoding possible by any means.

A good exercise if you are having trouble with this is to implement a little linked list that performs those operations on something that simply backs to an array, including the blind O(n) copy when appending another list. It should be quite short. But that is not the only alternate implementation, either, and you can easily build others where the interface is maintained but you pay only log n additional factors maximum for operations, easily recovered from the constant factors which on modern hardware are often staggering.

Once you break the entanglement between memory representation and the API a data structure offers in your mind, many ways become obvious as to how to possibly improve this structure while maintaining the same operations, many of which of course already exist and have names.

2 more replies

hayley-patton2y ago

SBCL tries its hardest to allocate sequentially, then moves lists to be sequential in GC.

1 more reply

bjoli2y ago

Well, I once wrote a small scheme that did loop unrolling and allocated lists in as many steps as the unroll went, which was capped at a maximum of 5.

That worked surprisingly well, but there were lots of edge cases since I am a professional bassoonist and not a compiler guy.

I generalized it and made all lists allocated in chunks of 16 and then let the GC truncate it. I spent a stupid amount of work doing the first thing and the second thing was done in an afternoon.

Then there are all kinds of

Someone2y ago

A sufficiently smart compiler can detect that a new node gets allocated and then appended to a list, and allocate the node near the other nodes of that list.

To have a good chance that there is space near the other nodes, you’d have to reserve memory up front for each list you’d want to do that with, though, and that will kill cache locality before that optimization sets in. Corrections welcome, but I don’t see that (easily) being a net win. But hey, with a sufficiently smart compiler at hand, you can do wonders. (Edit: when looking at pure functions, it many times may not be that hard to discriminate between “this allocation will be returned” and “this allocation is temporary”. For example, if you use map to apply a function to all elements in a list, and that function is pure, detecting that only the return values of that function will be visible after the function returns isn’t that hard)

An easier win is that lisp garbage collectors tend to move memory, and thus can try to move nodes that reference each other close together. You get that for free with a semispace collector, but that has quite a few disadvantages, so you don’t want to use that. A generational collector also will do it a bit, though (say you create a few million nodes of garbage to build a list of a few hundred results. Then, after generational collection, those few hundred cells will be closer together than before)

zelphirkalt2y ago

If needed, linked list can be backed by a vector, that is resized, when needed.

twic2y ago

If anyone knows of a detailed writeup of a language implementation like this, please link to it. The thread under this comment is disappointingly full of "a sufficiently smart compiler could ..." type stuff.

Xeamek2y ago

Why cant we 'simply' get processor extension to mark data as pointer so that the prefetcher would actually know what to fetch.

From my understanding this is what led to the recent 'unpatchable' exploit in Apple's M1, but rather then trying to guess it by some heuristic, why not just give compilers option to make that optimization?

favorited2y ago

Apple Silicon does this. If it prefetches something that looks like a pointer, it will also fetch the pointed-to memory. It's a cool feature, and is especially useful for Apple, since their Objective-C collections only store pointers – but it also can re-open the door for certain timing attacks by violating preconditions of constant-time cryptography algorithms.

https://en.wikipedia.org/wiki/GoFetch

Joker_vD2y ago

I don't think making CPU issue (likely bogus) pre-fetches for every field in the cache line that's marked as a pointer is really that good idea. At best, you save couple of cycles because the fetches are started a one or two instructions earlier before the actual load instruction for "loading the linked pointer" is issued. At worst, you keep thrashing your cache loading data you're not going to read, delaying fetching the data you will read.

Xeamek2y ago

For everything? Obviously no.

But for all the crazy optimizations modern compiler do, I don't see how marking pointers for more then couple of them in a raw is that crazy

1 more reply

infamouscow2y ago

This was the idea of Itanium. It failed mostly because of economics.

It turns out programmers, or rather their employers, don't really care about using hardware efficiency. They care about shipping things yesterday, because that's how business deals get closed, and making software efficient is secondary to money changing hands. Performance never really matters to business people after the checks are cashed.

Multicore computers have been ubiquitous for more than a decade, yet the overwhelming majority of software built today is single-threaded microservices, where in they spend most of their time serializing and deserializing message payloads.

This is all really to say that most performance is already being left on the table for the majority of what computers are used for today.

ComputerGuru2y ago

I do want to say that I think the Itanic would have fared way, way better in a post-LLVM world where the importance of smart, optimizing compilers is much more valued and understood and language designers actively work hand-in-hand with compiler devs far more often (with much more significant back-and-forth from hardware manufacturers).

1 more reply

kolbe2y ago

Given how much of today's computer needs are dependent on a database query, this is no surprise. Who cares about the micros you gain with added efficiency while there's a 100ms db query return in the path?

2 more replies

Xeamek2y ago

I mean sure, I don't doubt 99% of end-user programmers wouldn't look twice at something like this, but compilers designers probably would care.

And it's not like the companies arent trying this idea (again, M1 exploit). But for whatever reason they want to keep cpus as black box, perfect abstract machines, even though we know they aren't

thefaux2y ago

An exception to this is precisely parsing and validation in which linked lists can and ideally should be populated consecutively (using an arena allocator). I agree that general purpose linked lists are rarely optimal but they are hard to beat for parsers where they hit a pareto sweet spot of implementation simplicity and good performance.

dpc_012342y ago

This is such an important point.

Benchmarking is hard, and very often is done in ideal conditions, which never materialize in real use cases.

jkaptur2y ago

Another cool aspect of this and (if I understand correctly), where Rust really helps you is that you can explore multiple branches in parallel, since the linked list is immutable.

ComputerGuru2y ago

To be fair, lack of concurrency safety never stopped C and C++ devs from boldly doing just that, anyway!

eimrine2y ago

What if the linked list is cycled or doubly-linked?

jkaptur2y ago

I was talking about this case in particular, where we know the list is basically isomorphic to the call stack.

amelius2y ago

Then Rust isn't the right tool for the job. Rust is great for tree-like structures which is 99% of what you encounter anyway. Unless you're writing a kernel or something.

kevingadd2y ago

I was hoping to see optimization of the actual linked list manipulation and traversal (pipelining? i'm not sure what you'd do), but this is still a neat post. It's cool to see thought put into various parts of the problem, like reallocation/preallocation, stack allocation, etc.

cogman102y ago

reallocation/preallocation actually does increase manipulation and traversal performance.

The major slowness of linked lists is cache inconsistency. New nodes can be put all over the memory space. However, if you can make sure all or part of the list exists in contiguous blocks of memory, then there's a good chance that when the CPU loads up the next node it will also grab the next 3 nodes in a cache line.

The closer these node addresses are in memory, the faster things will be.

osigurdson2y ago

Link lists can move items in O(1) but their O(N) search can be bad because of all of the cache line misses.

sfink2y ago

Linked lists get a bum rap.

Yes, if you have a simple choice between a vector and a linked list, then the vector is vastly superior due to locality and (for non-intrusive linked lists) allocations. So much so that vectors often win even when you're doing lots of O(n) deletions that would be O(1) with a linked list.

But that doesn't mean that linked lists are useless! A vector gives you a single ordering. What if you need multiple? What if you need a free list, which you're never going to be iterating over but will just grab off an item at a time? I find it quite common to have one "natural" order, for which I will use a vector (or equivalently, a bump allocator of fixed-size items), and then one or several auxiliary orders like the entries belonging to a particular owner or the ones that will need to be visited for some sort of cleanup or an undo list or some sort of stack or queue of pending items to process. Your common iteration will be in memory order, but that doesn't mean you won't ever want to do different iterations.

It annoys me that this is always omitted, with the attitude that linked lists are obsolete and useless because vectors be moar better faster gooder, to the point that it's a waste of time to learn how to manipulate linked lists anymore.

I guess a lot of this is probably due to the popularity of high level languages, where you're just dealing with references to everything in the first place. But in those, the arguments in favor of vectors are often not valid because that blessed golden vector of Objects is giving you no more locality than giving your Object a `next` field: the vector of Objects is represented internally as a vector of pointers to the actual Object data, so your oh so nicely cached lookups are doing memory reads that are 100% pure overhead compared to following a `next` link from an Object that fits into a cache line. In both cases, your performance is going to be limited by the locality of your Object data, which is the same whether you have a vector of pointers or Object data with an intrusive pointer.

Also, if you have an existing setup and need to introduce a new list, it is sometimes far easier to drop in a `next` (and maybe `prev`) field than to refactor everything to accommodate a new vector. Especially since the vector will move all of your data when resizing the vector, which invalidates any pointers you might be using for other purposes. If you'll be iterating that list frequently, then the vector may very well be a good idea. If it's just for error cases or slow paths, then linked lists really aren't bad at all.

I'm not trying to argue for linked lists here so much as arguing against the blanket arguments against them.

</rant>

PartiallyTyped2y ago

> What if you need multiple?

You can very much have a single vector owning the memory, and do all other ordering over auxiliary vectors of indices. Should be cheaper and faster than holding more linked lists.

If you want to remove elements, you can very much use tombstones to flag deleted elements and then clean up after some threshold.

sfink2y ago

> You can very much have a single vector owning the memory, and do all other ordering over auxiliary vectors of indices. Should be cheaper and faster than holding more linked lists.

Cheaper in space depends on the percentage of elements in the auxiliary list. An intrusive list has space for every element to be in the list, which is wasteful if few are. A vector that grows by doubling could waste nearly half of its elements. Indexes can often be smaller than pointers, though, which favors the vector approach.

Faster is debatable. Iterating the vector of indexes is quite fast, but indirecting into the data in a random order is still slow. An intrusive linked list doesn't need to do the former, only the latter. (Then again, it also bloats up the data slightly, which matters for small items since fewer fit in cache.)

The main reason why linked lists could still be at an advantage is if you don't want allocations in your auxiliary list manipulation path. Maybe you don't want the unpredictable timing, or maybe you can't handle failure.

I agree on tombstoning, but note that you're giving up some of the vector advantages by doing so. Your processing loop now has a much less predictable branch in it. (Though really, the vector probably still wins here, since the linked list is built out of data dependent accesses!)

Sometimes these auxiliary lists need to contain things that are no longer in the main list, too, as in the case of a free list. (And if you swap such things to the end, then you've just invalidated all of your indexes.) And non-intrusive linked lists can point to variable-size elements (though they lose most of the other benefits of an intrusive linked list.)

Anyway, it's the usual "use the right tool for the right job", I just claim that linked lists are sometimes the right tool.

(Random side note: I use "indexes" instead of "indices" these days, for a silly reason—"indices" is a fine word, I have no issue with it, but it seems to encourage some people to use the non-word "indice" which is an abomination.)

hi-v-rocknroll2y ago

Linked lists are often the wrong choice because they're rarely performant or efficient when compared to vecs in the real world.

In general, use vecs until you absolutely can't.

Perhaps there are a few uses for LL's in limited circumstances.

thewakalix2y ago

Wouldn't using push_back prevent the need to reverse the Vec at the end?

tomck2y ago

This article is disingenuous with its Vec benchmark. Each call to `validate` creates a new Vec, but that means you allocate + free the vec for each validation. Why not store the vec on the validator to reuse the allocation? Why not mention this in the article, i had to dig in the git history to find out whether the vec was getting reallocated. This feels like you had a cool conclusion for your article, 'linked lists faster than vec', but you had to engineer the vec example to be worse. Maybe I'm being cynical.

It would be interesting to see the performance of a `Vec<&str>` where you reuse the vector, but also a `Vec<u8>` where you copy the path bytes directly into the vector and don't bother doing any pointer traversals. The example path sections are all very small - 'inner', 'another', 5 bytes, 7 bytes - less than the length of a pointer! storing a whole `&str` is 16 bytes per element and then you have to rebuild it again anyway in the invalid case.

---

This whole article is kinda bad, it's titled 'blazingly fast linked lists' which gives it some authority but the approach is all wrong. Man, be responsible if you're choosing titles like this. Someone's going to read this and assume it's a reasonable approach, but the entire section with Vec is bonkers.

Why are we designing 'blazingly fast' algorithms with rust primitives rather than thinking about where the data needs to go first? Why are we even considering vector clones or other crazy stuff? The thought process behind the naive approach and step 1 is insane to me:

1. i need to track some data that will grow and shrink like a stack, so my solution is to copy around an immutable Vec (???)

2. this is really slow for obvious reasons, how about we: pull in a whole new dependency ('imbl') that attempts to optimize for the general case using complex trees (???????????????)

You also mention:

> In some scenarios, where modifications occur way less often than clones, you can consider using Arc as explained in this video

I understand you're trying to be complete, but 'some scenarios' is doing a lot of work here. An Arc<[T]> approach is literally just the same as the naive approach but with extra atomic refcounts! Why mention it in this context?

You finally get around to mutating the vector + using it like a stack, but then comment:

> However, this approach requires more bookkeeping and somewhat more lifetime annotations, which can increase code complexity.

I have no idea why you mention 'code complexity' here (complexity introduced by rust and its lifetimes), but fail to mention how adding a dependency on 'imbl' is a negative.

thefaux2y ago

> how about we: pull in a whole new dependency ('imbl') that attempts to optimize for the general case using complex trees (???????????????)

To me this is a self answering question.

ww5202y ago

Yes, when I saw the immutable path is cloned and appended a key, I knew the benchmark was a strawman.

dmitry_dygaloOP2y ago

> This article is disingenuous with its Vec benchmark. Each call to `validate` creates a new Vec, but that means you allocate + free the vec for each validation. Why not store the vec on the validator to reuse the allocation? Why not mention this in the article, i had to dig in the git history to find out whether the vec was getting reallocated.

The idea comes back to [0] which is similar to one of the steps in the article, and before adding `push` & `pop` I just cloned it to make things work. That's what Rust beginners do.

> This feels like you had a cool conclusion for your article, 'linked lists faster than vec', but you had to engineer the vec example to be worse. Maybe I'm being cynical.

Maybe from today's point in time, I'd think the same.

> It would be interesting to see the performance of a `Vec<&str>` where you reuse the vector, but also a `Vec<u8>` where you copy the path bytes directly into the vector and don't bother doing any pointer traversals. The example path sections are all very small - 'inner', 'another', 5 bytes, 7 bytes - less than the length of a pointer! storing a whole `&str` is 16 bytes per element and then you have to rebuild it again anyway in the invalid case.

Yeah, that makes sense to try!

> This whole article is kinda bad, it's titled 'blazingly fast linked lists' which gives it some authority but the approach is all wrong. Man, be responsible if you're choosing titles like this. Someone's going to read this and assume it's a reasonable approach, but the entire section with Vec is bonkers.

> Why are we designing 'blazingly fast' algorithms with rust primitives rather than thinking about where the data needs to go first? Why are we even considering vector clones or other crazy stuff? The thought process behind the naive approach and step 1 is insane to me:

> 1. i need to track some data that will grow and shrink like a stack, so my solution is to copy around an immutable Vec (???)

> 2. this is really slow for obvious reasons, how about we: pull in a whole new dependency ('imbl') that attempts to optimize for the general case using complex trees (???????????????)

That's clickbait-y, though none of the article's ideas aim to be a silver bullet. I mean, there are admittedly dumb ideas in the article, though I won't believe that somebody would come up with a reasonable solution without trying something stupid first. However, I might have used better wording to highlight that and mention that I've come up with some of these ideas when was working on `jsonschema` in the past.

> I understand you're trying to be complete, but 'some scenarios' is doing a lot of work here. An Arc<[T]> approach is literally just the same as the naive approach but with extra atomic refcounts! Why mention it in this context?

If you don't need to mutate the data and need to store it in some other struct, it might be useful, i.e. just to have cheap clones. But dang, that indeed is a whole different story.

> I have no idea why you mention 'code complexity' here (complexity introduced by rust and its lifetimes), but fail to mention how adding a dependency on 'imbl' is a negative.

Fair. Adding `imbl` wasn't a really good idea for this context at all.

Overall I think what you say is kind of fair, but I think that our perspectives on the goals of the article are quite different (which does not disregard the criticism).

Thank you for taking the time and answer!

- [0] - https://github.com/Stranger6667/jsonschema-rs/commit/1a1c6c3...

ILoveQaWolf2y ago

this is interesting!

j / k navigate · click thread line to collapse

127 comments

duped2y ago

All that said, serde_json is incredibly convenient and giving up to write your own parser is a big hammer for a problem that probably doesn't need it.

ComputerGuru2y ago

> All that said, serde_json is incredibly convenient and giving up to write your own parser is a big hammer for a problem that probably doesn't need it.

[0]: https://news.ycombinator.com/item?id=40357159

aldanor2y ago

Or, use sonic_rs::Value which uses simd and arena allocation for keys and in my case for big payloads was 8x faster than serds_json.

Or, use sonic_rs::LazyValue if you want to parse json on demand as you traverse it.

ComputerGuru2y ago

Hey, that's a nifty looking library. Thanks for pointing it out to me!

ww5202y ago

I've seen a lexer/parser scheme that encodes the lexer token type along with the token file location information into a u64 integer, something like

    struct Token {
        token_type: u8,
        type_info: u8,
        token_start: u32,    // offset into the source file.
        token_len: u16
    }

It's blazing fast. The lexer/parser can process millions of lines per second. The textual information is included, and the location information is included.

acidx2y ago

This is roughly what my JSON parser does. It does type-checking, albeit without using JSON-schema, but an object descriptor that you have to define to parse and generate JSON.

duped2y ago

dmitry_dygaloOP2y ago

That is really cool idea! Thank you

EGreg2y ago

Just use capn’proto. No deserialization needed !

aabhay2y ago

Whats your experience like using it? Is it ergonomic or does it require you to do lots of type gymnastics?

cabronerp2y ago

This repo has a nice pub/sub implementation based on capnp: https://github.com/commaai/cereal/blob/master/log.capnp

evmar2y ago

The "maybe you don't need a linked list" proposal at the bottom seems significantly better than the options presented in the post:

- almost no cost in the non-erroring path

- no extra data structures to manage

- a lot less code

I think the post would benefit from a better analysis of why this doesn't work for them.

dmitry_dygaloOP2y ago

Indeed, I agree with your points.

If I'd discovered this idea earlier, I'd probably spend more time investigating it.

vouwfietsman2y ago

Indeed, also building a linked list over the stack like that is a crafty but very weird design. Keep it simple.

ComputerGuru2y ago

Nice post, Dmitry!

dmitry_dygaloOP2y ago

Thanks!

Agree!

Oh, this is a cool observation! Indeed it feels like `with_capacity` would help here

Now when I read the details, it feels like a really really cool idea!

Nice! In the original `jsonschema` implementation the `validate` function returns `Result<(), ErrorIter>` which makes it more complex to apply that approach, but I think it still should be possible.

ertucetin2y ago

7-8 years ago I created GlueList (https://github.com/ertugrulcetin/GlueList) in order to build faster version of LinkedList + ArrayList. It was a fun effort.

duped2y ago

If I have this right, what you've built is this, storing M items in M / N nodes where N is the radix of the array?

    0                 1                        M / Nth node
    [ N elem ] [.] -> [N elem] [.] -> .... -> [M - (M / N) elem] [null]

And so if you want to index the `i`th element you chase the pointers

    index (node i) :
        acc = 0
        while acc + N < i
           acc += N
           node = node.next
        return node.array[j - acc]

ertucetin2y ago

You're probably right, I was in college back then and wanted to work on something cool, this was my idea.

amelius2y ago

It was fun because you didn't write it in Rust :)

https://rust-unofficial.github.io/too-many-lists/

twic2y ago

Similar to an unrolled linked list: https://en.wikipedia.org/wiki/Unrolled_linked_list

But you size the nodes dynamically, rather than using a fixed size.

boxed2y ago

I believe that's called a "rope"

vbezhenar2y ago

Rope is binary tree of arrays, for O(logN) index operation.

boxed2y ago

Ah, my bad. What's the name for the thing he's talking about?

1 more reply

vlovich1232y ago

torusle2y ago

> Linked lists are taught as fundamental data structures in programming courses, but they are more commonly encountered in tech interviews than in real-world projects.

I beg to disagree.

In kernels, drivers, and embedded systems they are very common.

saghm2y ago

dmitry_dygaloOP2y ago

This was my intention

SoftTalker2y ago

1 more reply

samatman2y ago

You need a linked list to write hello world in any Lisp, though.

Seems like the glaring exception to the rule!

lispm2y ago

As source code, but not necessarily as running code.

SBCL:

    * (defun hello-world () (write-string "hello world"))
    HELLO-WORLD

    * (disassemble #'hello-world)
    ; disassembly for HELLO-WORLD
    ; Size: 36 bytes. Origin: #x100311C85C                        ; HELLO-WORLD
    ; 5C:       AA0A40F9         LDR R0, [THREAD, #16]            ; binding-stack-pointer
    ; 60:       4A0B00F9         STR R0, [CFP, #16]
    ; 64:       EAFDFF58         LDR R0, #x100311C820             ; "hello world"
    ; 68:       570080D2         MOVZ NARGS, #2
    ; 6C:       29EC80D2         MOVZ TMP, #1889
    ; 70:       BE6B69F8         LDR LR, [NULL, TMP]              ; WRITE-STRING
    ; 74:       DE130091         ADD LR, LR, #4
    ; 78:       C0031FD6         BR LR
    ; 7C:       E00120D4         BRK #15                          ; Invalid argument count trap

The actual code for this example is machine code (which references a string, which is a vector), here without linked lists.

kazinator2y ago

No, you don't need to use linked lists to send a string to the standard output port in most Lisps. You just call a function.

1 more reply

ComputerGuru2y ago

wesnerm22y ago

hi-v-rocknroll2y ago

Modern processor and cache performance lend themselves to vectors and SSA. Linked lists just don't scale well outside of niche uses.

ziddoap2y ago

>In kernels, drivers, and embedded systems they are very common.

Out of all the programmers in the world, what percentage of them do you think work in the kernel/driver/embedded spaces?

sfink2y ago

josephg2y ago

96592y ago

coldtea2y ago

More like 0.01% -- if we consider enterprise programmers, web programmers, and application/game programmers which I'd expect to be the largest groups...

2 more replies

TheCondor2y ago

_xnmw2y ago

Why? Why would someone reach for a linked list in a kernel, driver, or embedded system?

sratner2y ago

In embedded world, memory often needs to be exactly controlled, and allocation failures are fatal without a more complex MMU. In kernel world, I believe the main reason is that allocations can block.

cyberax2y ago

In addition, a lot of data structures might be shared across multiple cores. Linked lists can be traversed and mutated concurrently (although with a bit of care).

dist1ll2y ago

I wonder how much of that is due to the kernel history, and the influence of C idioms, and not because of some inherent design superiority.

1 more reply

kevingadd2y ago

Intrusive lists are really powerful for those kinds of scenarios, and technically are linked lists. They're widely used in the kernel, IIRC.

akira25012y ago

O(n) iteration but pretty much guaranteed O(1) for every other operation. If that's the semantic you need, then linked lists are your friend.

vineyardlabs2y ago

1 more reply

dmitry_dygaloOP2y ago

Easier to avoid allocation errors, e.g. in the Linux kernel. I think Alice Ryhl mentioned it here - https://www.youtube.com/watch?v=CEznkXjYFb4

SJC_Hacker2y ago

How do linked list prevent allocation errors? If anything it would seem to make them worse.

My experience in embedded, everything is hardcoded as a compile time constant, including fixed size arrays (or vectors of a fixed capacity)

3 more replies

dmitry_dygaloOP2y ago

hi-v-rocknroll2y ago

davexunit2y ago

I don't do any of those things and I still use lists constantly. Kinda strange to learn that many others don't use them at all it seems.

nequo2y ago

What kinds of things are you using them for usually? Is it mostly in C/C++?

waynesonfire2y ago

linked lists shine when you can perform a O(1) remove operation if you have a reference to an object on the list. This is very common when using C structs and not possible in Java for example.

adgjlsfhk12y ago

these cases are usually cases where you want to use a (hash) Set. If you're Ok changing everyone's indexing, the indexing didn't matter.

kolbe2y ago

bjoli2y ago

Most languages with linked lists as an important part (lisps mostly) all have well optimized linked lists that end up with a lot better memory locality.

wavemode2y ago

jerf2y ago

While there is a natural encoding of this process into a pointer to the target value, and a pointer to the next node, it is not the only encoding possible by any means.

2 more replies

hayley-patton2y ago

SBCL tries its hardest to allocate sequentially, then moves lists to be sequential in GC.

1 more reply

bjoli2y ago

Well, I once wrote a small scheme that did loop unrolling and allocated lists in as many steps as the unroll went, which was capped at a maximum of 5.

That worked surprisingly well, but there were lots of edge cases since I am a professional bassoonist and not a compiler guy.

I generalized it and made all lists allocated in chunks of 16 and then let the GC truncate it. I spent a stupid amount of work doing the first thing and the second thing was done in an afternoon.

Then there are all kinds of

Someone2y ago

A sufficiently smart compiler can detect that a new node gets allocated and then appended to a list, and allocate the node near the other nodes of that list.

zelphirkalt2y ago

If needed, linked list can be backed by a vector, that is resized, when needed.

twic2y ago

Xeamek2y ago

Why cant we 'simply' get processor extension to mark data as pointer so that the prefetcher would actually know what to fetch.

favorited2y ago

https://en.wikipedia.org/wiki/GoFetch

Joker_vD2y ago

Xeamek2y ago

For everything? Obviously no.

But for all the crazy optimizations modern compiler do, I don't see how marking pointers for more then couple of them in a raw is that crazy

1 more reply

infamouscow2y ago

This was the idea of Itanium. It failed mostly because of economics.

This is all really to say that most performance is already being left on the table for the majority of what computers are used for today.

ComputerGuru2y ago

1 more reply

kolbe2y ago

2 more replies

Xeamek2y ago

I mean sure, I don't doubt 99% of end-user programmers wouldn't look twice at something like this, but compilers designers probably would care.

And it's not like the companies arent trying this idea (again, M1 exploit). But for whatever reason they want to keep cpus as black box, perfect abstract machines, even though we know they aren't

thefaux2y ago

dpc_012342y ago

This is such an important point.

Benchmarking is hard, and very often is done in ideal conditions, which never materialize in real use cases.

jkaptur2y ago

Another cool aspect of this and (if I understand correctly), where Rust really helps you is that you can explore multiple branches in parallel, since the linked list is immutable.

ComputerGuru2y ago

To be fair, lack of concurrency safety never stopped C and C++ devs from boldly doing just that, anyway!

eimrine2y ago

What if the linked list is cycled or doubly-linked?

jkaptur2y ago

I was talking about this case in particular, where we know the list is basically isomorphic to the call stack.

amelius2y ago

Then Rust isn't the right tool for the job. Rust is great for tree-like structures which is 99% of what you encounter anyway. Unless you're writing a kernel or something.

kevingadd2y ago

cogman102y ago

reallocation/preallocation actually does increase manipulation and traversal performance.

The closer these node addresses are in memory, the faster things will be.

osigurdson2y ago

Link lists can move items in O(1) but their O(N) search can be bad because of all of the cache line misses.

sfink2y ago

Linked lists get a bum rap.

I'm not trying to argue for linked lists here so much as arguing against the blanket arguments against them.

</rant>

PartiallyTyped2y ago

> What if you need multiple?

You can very much have a single vector owning the memory, and do all other ordering over auxiliary vectors of indices. Should be cheaper and faster than holding more linked lists.

If you want to remove elements, you can very much use tombstones to flag deleted elements and then clean up after some threshold.

sfink2y ago

> You can very much have a single vector owning the memory, and do all other ordering over auxiliary vectors of indices. Should be cheaper and faster than holding more linked lists.

Anyway, it's the usual "use the right tool for the right job", I just claim that linked lists are sometimes the right tool.

hi-v-rocknroll2y ago

Linked lists are often the wrong choice because they're rarely performant or efficient when compared to vecs in the real world.

In general, use vecs until you absolutely can't.

Perhaps there are a few uses for LL's in limited circumstances.

thewakalix2y ago

Wouldn't using push_back prevent the need to reverse the Vec at the end?

tomck2y ago

---

1. i need to track some data that will grow and shrink like a stack, so my solution is to copy around an immutable Vec (???)

2. this is really slow for obvious reasons, how about we: pull in a whole new dependency ('imbl') that attempts to optimize for the general case using complex trees (???????????????)

You also mention:

> In some scenarios, where modifications occur way less often than clones, you can consider using Arc as explained in this video

You finally get around to mutating the vector + using it like a stack, but then comment:

> However, this approach requires more bookkeeping and somewhat more lifetime annotations, which can increase code complexity.

I have no idea why you mention 'code complexity' here (complexity introduced by rust and its lifetimes), but fail to mention how adding a dependency on 'imbl' is a negative.

thefaux2y ago

> how about we: pull in a whole new dependency ('imbl') that attempts to optimize for the general case using complex trees (???????????????)

To me this is a self answering question.

ww5202y ago

Yes, when I saw the immutable path is cloned and appended a key, I knew the benchmark was a strawman.

dmitry_dygaloOP2y ago

The idea comes back to [0] which is similar to one of the steps in the article, and before adding `push` & `pop` I just cloned it to make things work. That's what Rust beginners do.

> This feels like you had a cool conclusion for your article, 'linked lists faster than vec', but you had to engineer the vec example to be worse. Maybe I'm being cynical.

Maybe from today's point in time, I'd think the same.

Yeah, that makes sense to try!

> 1. i need to track some data that will grow and shrink like a stack, so my solution is to copy around an immutable Vec (???)

> 2. this is really slow for obvious reasons, how about we: pull in a whole new dependency ('imbl') that attempts to optimize for the general case using complex trees (???????????????)

If you don't need to mutate the data and need to store it in some other struct, it might be useful, i.e. just to have cheap clones. But dang, that indeed is a whole different story.

> I have no idea why you mention 'code complexity' here (complexity introduced by rust and its lifetimes), but fail to mention how adding a dependency on 'imbl' is a negative.

Fair. Adding `imbl` wasn't a really good idea for this context at all.

Overall I think what you say is kind of fair, but I think that our perspectives on the goals of the article are quite different (which does not disregard the criticism).

Thank you for taking the time and answer!

- [0] - https://github.com/Stranger6667/jsonschema-rs/commit/1a1c6c3...

ILoveQaWolf2y ago

this is interesting!

j / k navigate · click thread line to collapse