Fast, simple, hard real time allocator for Rust (opens in new tab)

(github.com)

207 pointspracticalrs2y ago58 comments

58 comments

I wrote almost the exact same thing after seeing Sebastian Aaltonen's original offset allocator repo which inspired me to write my own clone in Rust.

It's possible to further improve the fragmentation characteristics if the maximum size of the memory to be allocated is known ahead of time. The original used a 5.3 "floating point" scheme for the free bins, but given the maximum size you can opt for 4.4 or 6.2 or whatever can cover the whole region, giving more dense (or sparse) allocation granularity as needed.

This same idea can also be extended to allocating texture atlas regions when the regions have power of two size. 256 bins can cover all possible texture sizes that GPUs can handle, so you can get hard O(1) texture packing.

I'm also curious about making something like this work as a general purpose allocator (as alluded to by the README). For that it would be necessary to make a reverse lookup from pointer address to allocation handle, which would require something like a red-black tree covering the address space, which would no longer be 0(1). If anyone has ideas on this front, I would be happy to hear them.

This algorithm is so clever and kind of a game changer. Allocating and freeing are fast real time operations (only dozens of CPU cycles and a few memory writes). It kind of makes "bump allocators" obsolete because this allocator is almost as fast but supports trimming and freeing allocations as well. This algorithm requires more memory but the amount is quite low.

scott_s2y ago

> For that it would be necessary to make a reverse lookup from pointer address to allocation handle, which would require something like a red-black tree covering the address space, which would no longer be 0(1). If anyone has ideas on this front, I would be happy to hear them.

A radix tree can solve this: https://en.wikipedia.org/wiki/Radix_tree

I used one way, way back to do exactly the same thing: upon a free, I needed to look up all of the metadata for an address. For a 32-bit address space, you can just allocate a giant array up front, and use the address as an index. For a 64-bit address space, it's obviously way too big to statically allocate it up front. A radix tree neatly solves the problem. An arbitrarily sized radix tree is not constant lookup, but for reasons I honestly cannot remember, for a 64-bit address space, it's guaranteed to only be 3 levels deep.

See my implementation, which (I believe) I borrowed from tcmalloc: https://github.com/scotts/streamflow/blob/master/streamflow....

ot2y ago

Radix trees are not O(1), they're O(log n). Data structures that support constant-time predecessor lookup do not exist, there is a super-constant lower bound, even in the RAM model [1].

Often I hear people say "well pointer size is constant, so log(64) = O(1)", but that is misleading: if you assume constant pointer size, any function of that is O(1) as well, so any bounded algorithm is O(1). The notation just becomes meaningless.

Asymptotic bounds must be presented and understood in context.

[1] https://en.wikipedia.org/wiki/Predecessor_problem#Mathematic...

scott_s2y ago

This function is O(1): https://github.com/scotts/streamflow/blob/master/streamflow..... All other tree operations are also constant, because the fact that it is a 3-level tree is hardcoded.

Asymptotic bounds are useful when your input can grow arbitrarily large. When your input is fixed, the bounds become less useful and you should focus on the particulars of your case. In the case I'm presenting, the work done traversing the tree will be the same every time; it is O(1). That doesn't necessarily mean it's fast enough! It still may do too much work for the use case. For instance, I can imagine someone saying that the function above does too much pointer chasing for their use case.

1 more reply

skybrian2y ago

Given that pointer size is fixed, maybe asymptotic bounds aren’t what we should care about? Maybe it would be better to ask how the cost increases just going up to 2^64 bytes of address space. The graph after that point is irrelevant.

An O(n^2) algorithm will blow up well before that.

Dylan168072y ago

> Often I hear people say "well pointer size is constant, so log(64) = O(1)", but that is misleading: if you assume constant pointer size, any function of that is O(1) as well, so any bounded algorithm is O(1). The notation just becomes meaningless.

It's not meaningless. We're designing for a real machine here, and the actual goal is to bound the runtime to a small known constant.

You're right that big O is not great notation for this. Hell, even O(1) isn't good enough because the constant factor is unspecified.

But the misuse of notation does not invalidate the problem. Taking the worst case for a log leaves you with a feasible runtime. Taking the worst case for n or n^2 or "any bounded algorithm" overwhelmingly does not.

> Asymptotic bounds must be presented and understood in context.

Yes, context is exactly what keeps it meaningful!

yvt2y ago

I did this too! Except it was based on the original TLSF paper (on which this work is based) as I wasn't aware of that offset allocator.

> I'm also curious about making something like this work as a general purpose allocator (as alluded to by the README).

In a general purpose allocator, you can just store allocation metadata next to a payload so the lookup from a pointer address becomes O(1). The original TLSF algorithm (as proposed by the paper) was designed this way.

Months later, I also wrote a general purpose implementation of TLSF for Rust `GlobalAlloc`, which, with some rigorous optimizations, turned out to be faster and smaller than every allocator for WebAssembly and embedded environments I tested, and it's still my go-to choice. It didn't fare too bad even when compared against modern thread-caching allocators in normal use cases. Internal fragmentation is probably worse, though.

scott_s2y ago

> In a general purpose allocator, you can just store allocation metadata next to a payload so the lookup from a pointer address becomes O(1).

Yes, but the downside is that now allocator metadata pollutes the cache. It's super efficient for the allocator, but it may harm efficiency of the actual use of that memory. I believe most production allocators don't use object headers for this reason.

Someone2y ago

> I believe most production allocators don't use object headers for this reason.

Isn’t the original reason hardening against buffer overflows? Overwriting allocator metadata can be used to attack a system.

1 more reply

ComputerGuru2y ago

Have you published your code to GH, by any chance?

yvt2y ago

Offset allocator: https://github.com/yvt/xalloc-rs/blob/master/src/tlsf.rs Don't mind the mess, it's one of my earliest works in Rust.

General-purpose allocator: https://github.com/yvt/rlsf

pcwalton2y ago

> I'm also curious about making something like this work as a general purpose allocator (as alluded to by the README).

Besides the reverse mapping, you'd also have to get rid of the Vec use in the allocator, but that's fairly straightforward. Additionally, you'd need some sort of logic about requesting slabs of memory from the OS (and returning them when free), which is some extra work.

pcwalton2y ago

Author here. Didn't expect this to make HN :)

I don't deserve any credit for this; @SebAaltonen did all the work to write the C++ version. I just did a line-by-line translation.

widdershins2y ago

What's the difference between this and a slab allocator? Is it just that the bin size distribution wastes less memory (assuming the slab allocator used pow2 bin sizes)?

hinkley2y ago

Slab allocators don’t provide real time guarantees. That is arena allocators. The distinction may seem trivial but the requirements are distinct.

In tiny systems all allocations of one type may come from a single operation, but in larger systems what fits in a slab will come from distinct concerns with different priorities. You want a different arena for those allocations.

widdershins2y ago

I'm confused. To me it seems that real time guarantees are a function of three things:

- Is it constant time in algorithmic complexity?

- Is it single threaded (i.e. free from locks)?

- Does it need to fetch memory from the system (not realtime safe)?

There are slab allocators that can satisfy all of these, and are therefore realtime safe.

So I'm still wondering what the difference is.

Ono-Sendai2y ago

I have a best-fit allocator for this problem (managing GPU resources). It has higher CPU overhead due to using a std::map, but should waste less space due to not using fixed bucket sizes.

https://github.com/glaretechnologies/glare-core/blob/master/... https://github.com/glaretechnologies/glare-core/blob/master/...

lionkor2y ago

> Please note that offset-allocator isn't a Rust allocator conforming to the GlobalAlloc trait

Aw. Why not?

> find the next available bin using 2x LZCNT instructions to make all operations O(1)

Isn't that sort of implementation defined? The LZCNT operation itself is a loop over all bits -- the fact that the number of bits is constant doesn't make this O(1), it just makes it O(n) where n := 64. But if it was 16 or 32 bit, it may be faster, which means its not O(1), or rather, big o notation really breaks down here.

orlp2y ago

> It just makes it O(n) where n := 64

That's O(1).

> The LZCNT operation itself is a loop over all bits

That's not how circuits work. You can easily make a circuit of O(log n) depth that returns base-2 encoded index of the first 1 bit in a n-bit register.

But since n is just 64 here, you're looking at a circuit of AND/OR depth ~8, so it's no surprise that modern CPUs can compute the LZCNT / TZCNT of a 64-bit integer in a single CPU cycle.

sapiogram2y ago

Note that rustc will not emit this instruction by default, to support older CPUs. Without `target-cpu=haswell` or similar, the operation will be quite slow, but still O(1) of course.

jsheard2y ago

It's not that bad, all x86 processors have the BSR instruction which can be mapped to LZCNT semantics with just a few extra instructions.

Left to right - modern x86-64, baseline x86-64, baseline i686:

https://rust.godbolt.org/z/nGsM35TEo

Maybe you're thinking of POPCNT, which hurts a lot more if it doesn't compile down to the native instruction:

https://rust.godbolt.org/z/xcxG3v4Mn

SkiFire132y ago

> That's O(1).

A loop over 2^64 elements is also O(1), but I don't think people are ok to label every program that can run on a 64 bit pc as O(1).

xboxnolifes2y ago

You should really refresh yourself on Big-O notation.

eru2y ago

Whether you call that O(1) or O(n) depends on the level of abstraction you want to talk about, and what you want to analyse.

For example, in some sense, anything you can do on a finite computer is O(1) (or alternatively, it's an infinite loop). But that is a very boring sense, and seldom useful for analysis.

There's a formal definition for big-O notation, but most of the parameters to that formal definition are only implied in casual discussions.

exDM692y ago

> Aw. Why not?

Because the global allocator API is defined as free(pointer address) and this used free(buffer handle).

It would require a reverse lookup structure from address to buffer handle, e.g. red-black tree. Maintaining it would no longer be O(1).

> the fact that the number of bits is constant doesn't make this O(1), it just makes it O(n) where n := 64

O(n) when n is constant is equal to O(1).

But lzcnt uses a fixed number of clock cycles so it's not really O(n) either.

jsheard2y ago

GlobalAlloc is also required to be thread safe, and TLSF doesn't even attempt to handle that. I suppose you could get away with it on single-threaded platforms like microcontrollers or (vanilla) WebAssembly but it wouldn't generalize unless you wrapped it in a mutex, and then the performance would be terrible.

exDM692y ago

Correct me if I'm wrong but can't you put a Rc<RefCell<OffsetAllocator>> in your std::alloc::Allocator implementation for interior mutability?

This make a non-thread safe allocator, with the (desired) side effect of making anything allocating from it (e.g. Vec) a !Send.

Or is there a requirement that Allocators must be Sync and Send?

1 more reply

couchand2y ago

Any uc worth programming on has interrupts!

1 more reply

orlp2y ago

> It would require a reverse lookup structure from address to buffer handle, e.g. red-black tree. Maintaining it would no longer be O(1).

Well, another solution is enlarging each allocation by 8 bytes and storing the buffer handle in front of the returned allocation. So `free(ptr)` would get the handle as `*((ptr as *const u64) - 1)`.

exDM692y ago

That works but...

This kind of allocators are usually used for suballocating GPU buffers, so hiding a few bytes of metadata "in-band" can mess up your alignment requirements and not all kinds of GPU memory are even accessible by CPU. Due to false sharing cache problems you would probably want a full cache line (64 bytes) to store the metadata.

For a CPU-only memory allocator your idea could work quite well. It can also be implemented on top of this code without any modifications.

Sharlin2y ago

OTOH, I think this would be a good fit for the "Store" proposal [1] which uses handles rather than addresses to refer to allocations.

[1] https://github.com/matthieu-m/storage/blob/main/etc/rfc.md

pbalcer2y ago

> It would require a reverse lookup structure from address to buffer handle, e.g. red-black tree. Maintaining it would no longer be O(1).

Not necessarily. If you are able to map a block of normal memory in a known location relative to the GPU memory that you are managing with an "offset allocator", it should be possible to directly calculate the metadata location for each "offset". This is how most allocators find arenas/buckets (whatever you want to call them) for an allocation by a pointer.

Something like this:

  +-------------------------+ 0x000000 (start of managed memory)
  | Metadata                | 
  |                         |
  |                         |
  | ...                     |
  +-------------------------+ 0x001000 
  | Padding ...             |
  +-------------------------+ 0x010000
  | GPU Memory Block        |
  |                         |
  |                         |
  |                         | ~2MB block
  |                         |
  |                         |
  |                         |
  +-------------------------+ 0x210000

With this layout, to get to a metadata for an allocation, all you need to do is to align down the allocation pointer and calculate the appropriate location in the metadata page.

This obviously won't work in all scenarios, but it's a simple and practical way around a map lookup.

exDM692y ago

A good idea but not O(1).

If you have a large allocation you have to mark every "page" in metadata table which makes allocation O(size in bytes / page size).

The overhead might still be practically acceptable, even if it is not constant time.

jameshart2y ago

Any big O discussion usually breaks down (or at least, changes) if you start taking into account the data size of the operands.

You generally assume for big O purposes, when analyzing sorts for example, that comparisons of elements are constant time, and that swapping elements is constant time.

On an n-bit computer, when dealing with m-bit elements, those assumptions are broadly sound. Comparing two ints doesn’t depend on the ints. Reading or writing the int from memory doesn’t take more or less time either.

But the same algorithm, while it has the same big O relationship to the size of a collection, might have a different constant factor on different CPUs and for different values of m. And you might find that some algorithms that constant factor’s relationship to m has its own big O.

One common place this can bite you is when you try to apply big O analysis that has ‘constant time comparison’ assumption built in to, say, sorting arbitrarily long strings, or inserting into a hash table using arbitrary string keys. Comparing strings or calculating hashes of them is not constant time.

empath-nirvana2y ago

In a practical sense, how you define big O depends on what you consider to be the inputs to the function. If the runtime doesn't change depending on the size of the inputs that you care about, it's O(1).

Like, you might have a function with 2 inputs, N and M, and the run time is like O(m2^n), but n is fixed at a low number every time you'll run it in practice, so it's really just O(m) for your purposes.

jameshart2y ago

Right. O(f(n)) is literally only defined for situations where n 1: varies between different runs of the algorithm, and 2: can grow arbitrarily large. Even though in practice ‘arbitrarily large’ is always limited by memory, storage, etc.

Talking about algorithms being O(n) in the number of bits in a value is only reasonable if the number of bits in the value actually varies between runs.

slashdev2y ago

In all x86 this is an instruction that typically takes 3 clock cycles (latency) with 1 clock cycle throughput. That’s as constant time as you get.

Even if that were false, it’s O(k) where k is the number of bits, which is constant. However, in that case it might be slower than the alternatives.

dzaima2y ago

Integer addition, and indeed pretty much all operations, are also a loops over all bits, LZCNT is in no way at all unique here.

hinkley2y ago

Why is this hard realtime in the face of arbitrary allocations and deallocations? These dots aren't connected anywhere within two links distance from the given URL.

joshsyn2y ago

How is this exactly going to be used in bevy?

CaptainOfCoit2y ago

At least one mention of it here https://github.com/bevyengine/bevy/issues/12590

> Use one set of large mesh buffers per vertex attribute layout

> Problem: Index/vertex buffers have to be re-bound when the mesh changes. This adds overhead when encoding draws and when drawing. It also prevents some optimisations like being able to draw all objects for shadow mapping for a light in one draw.

> Solution(s): [...] Use an appropriate allocator like a port of Sebastian Aaltonen's offset allocator [https://github.com/sebbbi/OffsetAllocator] to manage allocation

Where the "port of Sebastian Aaltonen's offset allocator" is what got linked in this HN submission.

pcwalton2y ago

The goal is to start storing multiple meshes in the same vertex buffer/index buffer. (I already have a decent chunk of this code written in a branch.) This reduces binding overhead, but moreover it allows us to start using multidraw indirect where supported, which should reduce drawcall counts dramatically. I measured a 97% drawcall count reduction for the opaque pass on the Bistro test scene, for example. The benefits are even higher for shadow maps, because they use a single shader and so the drawcall count should drop to the single digits in most cases.

I've said it before, but if drawcall count is a problem in your game, you should complain to the engine vendor. On modern GPUs, there's no reason people should be manually optimizing for drawcall count in 2024. Game engines have the tools to make it a non-issue; they just need to use them.

mgrithm2y ago

whats a offset allocator?

jjtheblunt2y ago

https://github.com/sebbbi/OffsetAllocator

which has links to a pertinent paper at the end of that page.

hinkley2y ago

That doesn't explain anything.

And the linked paper doesn't even contain the word 'offset'.

jjtheblunt2y ago

I mean you could presumably read the code.

It looks like the general idea of TLSF is an array of second level arrays, where the second level arrays hold blocks of fixed size, and in TLSF those fixed sizes for the secondary arrays increase as powers of 2, and in this so called offset allocator the secondary level arrays use a more complex size statistically chosen. Allocation requests seemingly return an offset into a second level array.

Kinda uncommented code, to your point, makes it not obvious.

j / k navigate · click thread line to collapse

58 comments

exDM692y ago

I wrote almost the exact same thing after seeing Sebastian Aaltonen's original offset allocator repo which inspired me to write my own clone in Rust.

scott_s2y ago

A radix tree can solve this: https://en.wikipedia.org/wiki/Radix_tree

See my implementation, which (I believe) I borrowed from tcmalloc: https://github.com/scotts/streamflow/blob/master/streamflow....

ot2y ago

Radix trees are not O(1), they're O(log n). Data structures that support constant-time predecessor lookup do not exist, there is a super-constant lower bound, even in the RAM model [1].

Asymptotic bounds must be presented and understood in context.

[1] https://en.wikipedia.org/wiki/Predecessor_problem#Mathematic...

scott_s2y ago

This function is O(1): https://github.com/scotts/streamflow/blob/master/streamflow..... All other tree operations are also constant, because the fact that it is a 3-level tree is hardcoded.

1 more reply

skybrian2y ago

An O(n^2) algorithm will blow up well before that.

Dylan168072y ago

It's not meaningless. We're designing for a real machine here, and the actual goal is to bound the runtime to a small known constant.

You're right that big O is not great notation for this. Hell, even O(1) isn't good enough because the constant factor is unspecified.

> Asymptotic bounds must be presented and understood in context.

Yes, context is exactly what keeps it meaningful!

yvt2y ago

I did this too! Except it was based on the original TLSF paper (on which this work is based) as I wasn't aware of that offset allocator.

> I'm also curious about making something like this work as a general purpose allocator (as alluded to by the README).

scott_s2y ago

> In a general purpose allocator, you can just store allocation metadata next to a payload so the lookup from a pointer address becomes O(1).

Someone2y ago

> I believe most production allocators don't use object headers for this reason.

Isn’t the original reason hardening against buffer overflows? Overwriting allocator metadata can be used to attack a system.

1 more reply

ComputerGuru2y ago

Have you published your code to GH, by any chance?

yvt2y ago

Offset allocator: https://github.com/yvt/xalloc-rs/blob/master/src/tlsf.rs Don't mind the mess, it's one of my earliest works in Rust.

General-purpose allocator: https://github.com/yvt/rlsf

pcwalton2y ago

> I'm also curious about making something like this work as a general purpose allocator (as alluded to by the README).

pcwalton2y ago

Author here. Didn't expect this to make HN :)

I don't deserve any credit for this; @SebAaltonen did all the work to write the C++ version. I just did a line-by-line translation.

widdershins2y ago

What's the difference between this and a slab allocator? Is it just that the bin size distribution wastes less memory (assuming the slab allocator used pow2 bin sizes)?

hinkley2y ago

Slab allocators don’t provide real time guarantees. That is arena allocators. The distinction may seem trivial but the requirements are distinct.

widdershins2y ago

I'm confused. To me it seems that real time guarantees are a function of three things:

- Is it constant time in algorithmic complexity?

- Is it single threaded (i.e. free from locks)?

- Does it need to fetch memory from the system (not realtime safe)?

There are slab allocators that can satisfy all of these, and are therefore realtime safe.

So I'm still wondering what the difference is.

Ono-Sendai2y ago

I have a best-fit allocator for this problem (managing GPU resources). It has higher CPU overhead due to using a std::map, but should waste less space due to not using fixed bucket sizes.

https://github.com/glaretechnologies/glare-core/blob/master/... https://github.com/glaretechnologies/glare-core/blob/master/...

lionkor2y ago

> Please note that offset-allocator isn't a Rust allocator conforming to the GlobalAlloc trait

Aw. Why not?

> find the next available bin using 2x LZCNT instructions to make all operations O(1)

orlp2y ago

> It just makes it O(n) where n := 64

That's O(1).

> The LZCNT operation itself is a loop over all bits

That's not how circuits work. You can easily make a circuit of O(log n) depth that returns base-2 encoded index of the first 1 bit in a n-bit register.

But since n is just 64 here, you're looking at a circuit of AND/OR depth ~8, so it's no surprise that modern CPUs can compute the LZCNT / TZCNT of a 64-bit integer in a single CPU cycle.

sapiogram2y ago

Note that rustc will not emit this instruction by default, to support older CPUs. Without `target-cpu=haswell` or similar, the operation will be quite slow, but still O(1) of course.

jsheard2y ago

It's not that bad, all x86 processors have the BSR instruction which can be mapped to LZCNT semantics with just a few extra instructions.

Left to right - modern x86-64, baseline x86-64, baseline i686:

https://rust.godbolt.org/z/nGsM35TEo

Maybe you're thinking of POPCNT, which hurts a lot more if it doesn't compile down to the native instruction:

https://rust.godbolt.org/z/xcxG3v4Mn

SkiFire132y ago

> That's O(1).

A loop over 2^64 elements is also O(1), but I don't think people are ok to label every program that can run on a 64 bit pc as O(1).

xboxnolifes2y ago

You should really refresh yourself on Big-O notation.

eru2y ago

Whether you call that O(1) or O(n) depends on the level of abstraction you want to talk about, and what you want to analyse.

For example, in some sense, anything you can do on a finite computer is O(1) (or alternatively, it's an infinite loop). But that is a very boring sense, and seldom useful for analysis.

There's a formal definition for big-O notation, but most of the parameters to that formal definition are only implied in casual discussions.

exDM692y ago

> Aw. Why not?

Because the global allocator API is defined as free(pointer address) and this used free(buffer handle).

It would require a reverse lookup structure from address to buffer handle, e.g. red-black tree. Maintaining it would no longer be O(1).

> the fact that the number of bits is constant doesn't make this O(1), it just makes it O(n) where n := 64

O(n) when n is constant is equal to O(1).

But lzcnt uses a fixed number of clock cycles so it's not really O(n) either.

jsheard2y ago

exDM692y ago

Correct me if I'm wrong but can't you put a Rc<RefCell<OffsetAllocator>> in your std::alloc::Allocator implementation for interior mutability?

This make a non-thread safe allocator, with the (desired) side effect of making anything allocating from it (e.g. Vec) a !Send.

Or is there a requirement that Allocators must be Sync and Send?

1 more reply

couchand2y ago

Any uc worth programming on has interrupts!

1 more reply

orlp2y ago

> It would require a reverse lookup structure from address to buffer handle, e.g. red-black tree. Maintaining it would no longer be O(1).

Well, another solution is enlarging each allocation by 8 bytes and storing the buffer handle in front of the returned allocation. So `free(ptr)` would get the handle as `*((ptr as *const u64) - 1)`.

exDM692y ago

That works but...

For a CPU-only memory allocator your idea could work quite well. It can also be implemented on top of this code without any modifications.

Sharlin2y ago

OTOH, I think this would be a good fit for the "Store" proposal [1] which uses handles rather than addresses to refer to allocations.

[1] https://github.com/matthieu-m/storage/blob/main/etc/rfc.md

pbalcer2y ago

> It would require a reverse lookup structure from address to buffer handle, e.g. red-black tree. Maintaining it would no longer be O(1).

Something like this:

  +-------------------------+ 0x000000 (start of managed memory)
  | Metadata                | 
  |                         |
  |                         |
  | ...                     |
  +-------------------------+ 0x001000 
  | Padding ...             |
  +-------------------------+ 0x010000
  | GPU Memory Block        |
  |                         |
  |                         |
  |                         | ~2MB block
  |                         |
  |                         |
  |                         |
  +-------------------------+ 0x210000

With this layout, to get to a metadata for an allocation, all you need to do is to align down the allocation pointer and calculate the appropriate location in the metadata page.

This obviously won't work in all scenarios, but it's a simple and practical way around a map lookup.

exDM692y ago

A good idea but not O(1).

If you have a large allocation you have to mark every "page" in metadata table which makes allocation O(size in bytes / page size).

The overhead might still be practically acceptable, even if it is not constant time.

jameshart2y ago

Any big O discussion usually breaks down (or at least, changes) if you start taking into account the data size of the operands.

You generally assume for big O purposes, when analyzing sorts for example, that comparisons of elements are constant time, and that swapping elements is constant time.

empath-nirvana2y ago

jameshart2y ago

Talking about algorithms being O(n) in the number of bits in a value is only reasonable if the number of bits in the value actually varies between runs.

slashdev2y ago

In all x86 this is an instruction that typically takes 3 clock cycles (latency) with 1 clock cycle throughput. That’s as constant time as you get.

Even if that were false, it’s O(k) where k is the number of bits, which is constant. However, in that case it might be slower than the alternatives.

dzaima2y ago

Integer addition, and indeed pretty much all operations, are also a loops over all bits, LZCNT is in no way at all unique here.

hinkley2y ago

Why is this hard realtime in the face of arbitrary allocations and deallocations? These dots aren't connected anywhere within two links distance from the given URL.

joshsyn2y ago

How is this exactly going to be used in bevy?

CaptainOfCoit2y ago

At least one mention of it here https://github.com/bevyengine/bevy/issues/12590

> Use one set of large mesh buffers per vertex attribute layout

> Solution(s): [...] Use an appropriate allocator like a port of Sebastian Aaltonen's offset allocator [https://github.com/sebbbi/OffsetAllocator] to manage allocation

Where the "port of Sebastian Aaltonen's offset allocator" is what got linked in this HN submission.

pcwalton2y ago

mgrithm2y ago

whats a offset allocator?

jjtheblunt2y ago

https://github.com/sebbbi/OffsetAllocator

which has links to a pertinent paper at the end of that page.

hinkley2y ago

That doesn't explain anything.

And the linked paper doesn't even contain the word 'offset'.

jjtheblunt2y ago

I mean you could presumably read the code.

Kinda uncommented code, to your point, makes it not obvious.

j / k navigate · click thread line to collapse