Show HN: All-pair similarity search on millions of sets in Python and on laptop (opens in new tab)

(github.com)

123 pointsekzhu7y ago19 comments

19 comments

I see the terms "python" and "search" and "millions of sets" and think data science... except that in the data science contexts with which I'm familiar we're looking at billions of records among petabytes of data. I know the article is about what can be done on a laptop but I'm left wondering if this is a neat small-scale proof of concept or something that scales and I need to research more when I've had coffee rather than bourbon.

ekzhuOP7y ago

Author here. The algorithm used here is based on Google's 2007 paper "Scaling Up All Pairs Similarity Search." Since then I am sure they have started to look at billions of sets. Generally speaking exact algorithms like the one presented here max-out around 100M on not-crazy hardware, but going over a billion probably requires some approximate algorithms such as Locality Sensitive Hashing. You maybe interested in the work by Anshumali Shrivastava.

stuartaxelowen7y ago

Data science is about drawing conclusions from data, not the size of it. Most data science happens on datasets numbered in the hundreds rather than those in the millions.

lifeisstillgood7y ago

This is interesting - I am a bit confused about Big Data and data science etc. If I suggest my thoughts would you mind redirecting me?

- We have lots of data in different databases and just need a unified view (ETL / data warehousing) - it's where most data in most businesses is. trapped. Next steps: common data definitions across the company, top level imposition to get a grip

- We can pull data together but need it to undergo what-if analysis or aggregation for reporting. This is usually regulatory or data warehousing?

All the above are "size of Enterprise Oracle / other RDBMS". You could have billions of records here but usually billions comes from dozens of databases with millions each ...

Big Data seems to be at the point of trying to do the ETL/Data warehousing for those dozens of different databases - put it into a map reduce friendly structure (Spark, Hadoop) and then run general queires - data provenance becomes a huge issue then.

Then we have the data science approach of data in sets / key value stores that Inwoukd classify as predictive - K-nearest neighbour etc.

I suspect I am wildly wrong in many areas but just trying to get it straight

1 more reply

mygo7y ago

> except that in the data science contexts with which I'm familiar we're looking at billions of records among petabytes of data.

At the core of the author’s Show HN is an exact algorithm implementation / port for the all-pair similarity search. One of the steps of an all-pair similarity search, metric K-center, is an NP-complete problem. [1]

So we’ve got an exact algorithm that needs to solve an np-complete problem to produce a result, making it at least as hard.

Any speed increases to such an algorithm in the millions of data points is awesome! If you’ve got billions of data points chances are you can distill it down to millions, and if that’s possible you’d get an exact result. Or you could use a heuristic algorithm, some sort of polynomial-time approximation, which can scale to billions, and still get you a good-enough result.

1 - https://static.googleusercontent.com/media/research.google.c...

algorias7y ago

> So we’ve got an exact algorithm that needs to solve an np-complete problem to produce a result, making it at least as hard.

This is not correct. It's very obvious that all-pair similarity search can be solved in O(n^2) calls to the similarity metric, as stated in the readme. So unless the metric itself falls outside P, this problem is easy (but still hard to scale up in practice, of course)

psandersen7y ago

There is a lot of data science work even in the low thousands of samples. Arguably most of the work will be on medium size data.

bonniemuffin7y ago

It's not the size of the boat, it's the motion of the ocean.

Data science is about what you do with the data, not about how big the data is.

waterhouse7y ago

Regarding the similarity function used: I haven't studied it myself, but read this a long time ago:

http://www.benfrederickson.com/distance-metrics/

a-dub7y ago

Which hash function are you using for the minhashes for the LSH benchmark? Example code from datasketch seems to indicate SHA-1. Is there good reason for that? Have you tried out murmur? I wonder if it improves runtime?

ekzhuOP7y ago

I am sure Murmur3 would improve performance, but I doubt it would improve the indexing time very much. I can give it a try.

Update:

In IPython using pyhash library (C++):

  import pyhash
  h = pyhash.murmur3_32()
  timeit h(b"test")
  703 ns ± 4.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

  import hashlib
  timeit hashlib.sha1(b"test")
  217 ns ± 5.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

a-dub7y ago

  import pyhash
  h = pyhash.murmur3_32()
  timeit h(b"test")
  576 ns ± 3.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

  import hashlib
  timeit hashlib.sha1(b"test").digest()
  518 ns ± 5.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

  import mmh3
  timeit mmh3.hash(b"test")
  156 ns ± 0.704 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Worth keeping in mind that pyhash and mmh3 close out the hash immediately where hashlib needs digest() to call sha1_done().

That said, mmh3 seems to give a respectable 70% speedup! (assuming 32bit hashes are acceptable) ... actually, let's compare apples to apples:

  timeit mmh3.hash128(b"test")
  180 ns ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Still pretty awesome for the little test!

...

And for something more realistic:

  timeit hashlib.sha1(b"test"*4096).digest()
  21.9 µs ± 594 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

  timeit h(b"test"*4096)
  6.81 µs ± 38 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

  timeit mmh3.hash128(b"test"*4096)
  3.03 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

An order of magnitude! (well, close)

1 more reply

jononor7y ago

Might also want to test with a bigger buffer? There are likely to be constant-time overheads, especially for libraries which are C/C++ wrappers, and in general for hash function setup.

moultano7y ago

If you were curious about the reference to MinHash in the OP, I just wrote a gentle guide to the MinHash family of algorithms (including our recent research extending it to probability distributions.) https://moultano.wordpress.com/2018/11/08/minhashing-3kbzhsx...

bra-ket7y ago

How does it compare to Spotify “annoy” LSH https://github.com/spotify/annoy

j / k navigate · click thread line to collapse

19 comments

mikece7y ago

ekzhuOP7y ago

stuartaxelowen7y ago

Data science is about drawing conclusions from data, not the size of it. Most data science happens on datasets numbered in the hundreds rather than those in the millions.

lifeisstillgood7y ago

This is interesting - I am a bit confused about Big Data and data science etc. If I suggest my thoughts would you mind redirecting me?

- We can pull data together but need it to undergo what-if analysis or aggregation for reporting. This is usually regulatory or data warehousing?

All the above are "size of Enterprise Oracle / other RDBMS". You could have billions of records here but usually billions comes from dozens of databases with millions each ...

Then we have the data science approach of data in sets / key value stores that Inwoukd classify as predictive - K-nearest neighbour etc.

I suspect I am wildly wrong in many areas but just trying to get it straight

1 more reply

mygo7y ago

> except that in the data science contexts with which I'm familiar we're looking at billions of records among petabytes of data.

So we’ve got an exact algorithm that needs to solve an np-complete problem to produce a result, making it at least as hard.

1 - https://static.googleusercontent.com/media/research.google.c...

algorias7y ago

> So we’ve got an exact algorithm that needs to solve an np-complete problem to produce a result, making it at least as hard.

psandersen7y ago

There is a lot of data science work even in the low thousands of samples. Arguably most of the work will be on medium size data.

bonniemuffin7y ago

It's not the size of the boat, it's the motion of the ocean.

Data science is about what you do with the data, not about how big the data is.

waterhouse7y ago

Regarding the similarity function used: I haven't studied it myself, but read this a long time ago:

http://www.benfrederickson.com/distance-metrics/

a-dub7y ago

ekzhuOP7y ago

I am sure Murmur3 would improve performance, but I doubt it would improve the indexing time very much. I can give it a try.

Update:

In IPython using pyhash library (C++):

  import pyhash
  h = pyhash.murmur3_32()
  timeit h(b"test")
  703 ns ± 4.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

  import hashlib
  timeit hashlib.sha1(b"test")
  217 ns ± 5.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

a-dub7y ago

  import pyhash
  h = pyhash.murmur3_32()
  timeit h(b"test")
  576 ns ± 3.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

  import hashlib
  timeit hashlib.sha1(b"test").digest()
  518 ns ± 5.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

  import mmh3
  timeit mmh3.hash(b"test")
  156 ns ± 0.704 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Worth keeping in mind that pyhash and mmh3 close out the hash immediately where hashlib needs digest() to call sha1_done().

That said, mmh3 seems to give a respectable 70% speedup! (assuming 32bit hashes are acceptable) ... actually, let's compare apples to apples:

  timeit mmh3.hash128(b"test")
  180 ns ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Still pretty awesome for the little test!

...

And for something more realistic:

  timeit hashlib.sha1(b"test"*4096).digest()
  21.9 µs ± 594 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

  timeit h(b"test"*4096)
  6.81 µs ± 38 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

  timeit mmh3.hash128(b"test"*4096)
  3.03 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

An order of magnitude! (well, close)

1 more reply

jononor7y ago

Might also want to test with a bigger buffer? There are likely to be constant-time overheads, especially for libraries which are C/C++ wrappers, and in general for hash function setup.

moultano7y ago

bra-ket7y ago

How does it compare to Spotify “annoy” LSH https://github.com/spotify/annoy

j / k navigate · click thread line to collapse