Tiny Pointers (opens in new tab)

(arxiv.org)

163 pointslevzettelin1y ago35 comments

35 comments

I looked a bit into this a few years back and found it quite interesting. Despite them calling them "Tiny Pointers" I would say it's closer to a open addressing hash map. You have a specific key, and then you can "allocate" an entry in the hash map. This gives you back a "pointer". You can then later use the original key and the pointer together to determine the index of the entry. There's also a slight chance that the allocation will fail (similar to a hash collision in a hash map). The "trick" here is that two different keys can end up having the exact same pointer (because we're always dereferencing with the key). This makes them more compact.

I was struggling a bit to come up with good use cases for it. Their examples are all around combining them with existing data structures and they show that the space complexity is smaller, but it wasn't completely clear to me how feasible this would actually be in practice.

qazxcvbnm1y ago

I read the paper a few years ago and I agree that for such an incredible algorithmic improvement, it’s not trivial to find a use case, as you still need to maintain a separate (albeit algorithmically insignificant) lookup table. When I read the paper I (mistakenly) hoped it could be used for indexing on-disk structures without ever hitting the disk for things like B tree internal nodes. To get its sweet sweet algorithmic complexity, up its sleeve is once again (as I recall) the age old trick of rebuilding the structure when the size doubles, which makes it much less efficient than it sounds for most practical use cases. I suppose one good use case for this might be compressing database indexes, where you need to maintain a separate structure anyway and the space savings can be worth it.

judofyr1y ago

I didn't invest a lot of time in it because when these papers show no practical applications I always assume that the constant factors are too high. Especially when there's also no specific values of the constant factors presented. It's still somewhere back in my mind to implement it one day to figure out the nitty gritty details.

1 more reply

Izikiel431y ago

This guy used that paper to create a new hash table which is much faster at high load that current implementations:

https://www.quantamagazine.org/undergraduate-upends-a-40-yea...

nsteel1y ago

That paper was front page 2 days ago. It's the reason this paper is front page today. And while it is fantastic work, it has no obvious break-through application, as was discussed: https://news.ycombinator.com/item?id=43004955

1 more reply

taeric1y ago

It isn't hard to make a datastructure that indexes into itself. BDDs, for example, are often coded in this way. I did an admittedly poor job of one at https://taeric.github.io/trading-with-bdds.html, but I think it is enough to see the idea well enough.

CyberDildonics1y ago

That's just called a hash map.

mont_tag1y ago

Didn't Python's compact dictionary implementation do this a decade ago?

"The dict type now uses a “compact” representation based on a proposal by Raymond Hettinger which was first implemented by PyPy. The memory usage of the new dict() is between 20% and 25% smaller compared to Python 3.5." -- https://docs.python.org/3.6/whatsnew/3.6.html#whatsnew36-com...

"Note, the sizeof(index) can be as small as a single byte for small dicts, two bytes for bigger dicts and up to sizeof(Py_ssize_t) for huge dict." -- https://mail.python.org/pipermail/python-dev/2012-December/1...

The "tiny pointers" are in the _make_index method in the proof of concept code. -- https://code.activestate.com/recipes/578375-proof-of-concept...

      @staticmethod
      def _make_index(n):
          'New sequence of indices using the smallest possible datatype'
          if n <= 2**7: return array.array('b', [FREE]) * n       # signed char
          if n <= 2**15: return array.array('h', [FREE]) * n      # signed short
          if n <= 2**31: return array.array('l', [FREE]) * n      # signed long
          return [FREE] * n

The logic is still present today in CPython. -- https://raw.githubusercontent.com/python/cpython/3e222e3a159...

  dk_indices is actual hashtable.  It holds index in entries, or DKIX_EMPTY(-1)
  or DKIX_DUMMY(-2).
  Size of indices is dk_size.  Type of each index in indices varies with dk_size:

  * int8  for          dk_size <= 128
  * int16 for 256   <= dk_size <= 2**15
  * int32 for 2**16 <= dk_size <= 2**31
  * int64 for 2**32 <= dk_size

judofyr1y ago

The "compact dictionary" representation you're talking about not always using 64-bit numbers, but rather using log(n)-bit values when you have n entries (i.e. using 8-bit values when there's only <= 128 entries). The paper linked here talks about pointers which uses less than log(n)-bits for n entries.

yorwba1y ago

These are the "traditional log n-bit pointers" that tiny pointers improve on.

kragen1y ago

No, although it's not completely unrelated.

kragen1y ago

Note that this is not the paper by Krapivin that https://news.ycombinator.com/item?id=43002511 is about.

vlovich1231y ago

It’s the paper mentioned in the same article and one of the author’s is his professors.

kragen1y ago

Yes.

kazinator1y ago

> How large do the pointers need to be? The natural answer is that each pointer uses log nbits. However, the fact that each pointer has a distinct owner makes it possible to compress the pointers to o(log n) bits.

What if you have to debug the whole situation, such that you don't always know who is the owner of a pointer you are looking at?

> A user k can call Allocate(k) in order to get a tiny pointer p; they can dereference the tiny pointer pby computing a function Dereference(k,p) whose value depends only on k, p, and random bits; and they can free a tiny pointer p by calling a function Free(k,p).

That is tantamount ot saying that the pointer is not actually p but the tuple <k, p>, and so its size consists of the number of bits in k, the number of bits in p plus an indication of where the division between these bits lie: where k ends and p begins.

We can abbreviate <k, p> to p in contexts where k can be implicitly understood.

BenoitEssiambre1y ago

Here's a somewhat related post where I argue that logic that _can_ have small pointers has lower entropy and is more optimal as bayesian model of the domain: https://benoitessiambre.com/entropy.html

levzettelinOP1y ago

Can someone ELI5?

taeric1y ago

I could be wrong, but I think the easiest way to think of this is to consider how much extra memory programs took when compiled with 64 bit pointers over 32 bit ones. Suddenly every pointer takes double the memory. Which, sure, isn't a huge deal if you don't have a lot of memory allocations. But, if you do, it can add up.

Places it would likely impact more than you'd realize is in higher level language arrays where each item in the array is a pointer. For similar reasons, many datastructures can be coded such that each "pointer" is an array index instead.

So, extending all of that, what if you could make your pointers even smaller than 32 bits? If you know the addressable need for where the pointer is used, there is no reason you can't go smaller.

card_zero1y ago

Yet these aren't offsets from an address, that is, array indexes?

1 more reply

trymas1y ago

Most likely related with this post: https://news.ycombinator.com/item?id=43002511

chews1y ago

Correct, this is about Yao's conjecture and the implications for other hash implementations.

ceeam1y ago

Zoomers invented indexes.

ramon1561y ago

Programmer made program. What's new! When will they make magic orbs?!

nimih1y ago

Every author on this paper is a millennial or gen x.

kittikitti1y ago

Is there a peer reviewed version of this or does Hacker News exclusively post non peer reviewed works?

wbolt1y ago

https://dl.acm.org/doi/10.1145/3700594

kittikitti1y ago

Why wasn't this one originally linked?

j / k navigate · click thread line to collapse

35 comments

judofyr1y ago

qazxcvbnm1y ago

judofyr1y ago

1 more reply

Izikiel431y ago

This guy used that paper to create a new hash table which is much faster at high load that current implementations:

https://www.quantamagazine.org/undergraduate-upends-a-40-yea...

nsteel1y ago

1 more reply

taeric1y ago

CyberDildonics1y ago

That's just called a hash map.

mont_tag1y ago

Didn't Python's compact dictionary implementation do this a decade ago?

The "tiny pointers" are in the _make_index method in the proof of concept code. -- https://code.activestate.com/recipes/578375-proof-of-concept...

      @staticmethod
      def _make_index(n):
          'New sequence of indices using the smallest possible datatype'
          if n <= 2**7: return array.array('b', [FREE]) * n       # signed char
          if n <= 2**15: return array.array('h', [FREE]) * n      # signed short
          if n <= 2**31: return array.array('l', [FREE]) * n      # signed long
          return [FREE] * n

The logic is still present today in CPython. -- https://raw.githubusercontent.com/python/cpython/3e222e3a159...

  dk_indices is actual hashtable.  It holds index in entries, or DKIX_EMPTY(-1)
  or DKIX_DUMMY(-2).
  Size of indices is dk_size.  Type of each index in indices varies with dk_size:

  * int8  for          dk_size <= 128
  * int16 for 256   <= dk_size <= 2**15
  * int32 for 2**16 <= dk_size <= 2**31
  * int64 for 2**32 <= dk_size

judofyr1y ago

yorwba1y ago

These are the "traditional log n-bit pointers" that tiny pointers improve on.

kragen1y ago

No, although it's not completely unrelated.

kragen1y ago

Note that this is not the paper by Krapivin that https://news.ycombinator.com/item?id=43002511 is about.

vlovich1231y ago

It’s the paper mentioned in the same article and one of the author’s is his professors.

kragen1y ago

Yes.

kazinator1y ago

What if you have to debug the whole situation, such that you don't always know who is the owner of a pointer you are looking at?

We can abbreviate <k, p> to p in contexts where k can be implicitly understood.

BenoitEssiambre1y ago

Here's a somewhat related post where I argue that logic that _can_ have small pointers has lower entropy and is more optimal as bayesian model of the domain: https://benoitessiambre.com/entropy.html

levzettelinOP1y ago

Can someone ELI5?

taeric1y ago

So, extending all of that, what if you could make your pointers even smaller than 32 bits? If you know the addressable need for where the pointer is used, there is no reason you can't go smaller.

card_zero1y ago

Yet these aren't offsets from an address, that is, array indexes?

1 more reply

trymas1y ago

Most likely related with this post: https://news.ycombinator.com/item?id=43002511

chews1y ago

Correct, this is about Yao's conjecture and the implications for other hash implementations.

ceeam1y ago

Zoomers invented indexes.

ramon1561y ago

Programmer made program. What's new! When will they make magic orbs?!

nimih1y ago