- We have lots of data in different databases and just need a unified view (ETL / data warehousing) - it's where most data in most businesses is. trapped. Next steps: common data definitions across the company, top level imposition to get a grip
- We can pull data together but need it to undergo what-if analysis or aggregation for reporting. This is usually regulatory or data warehousing?
All the above are "size of Enterprise Oracle / other RDBMS". You could have billions of records here but usually billions comes from dozens of databases with millions each ...
Big Data seems to be at the point of trying to do the ETL/Data warehousing for those dozens of different databases - put it into a map reduce friendly structure (Spark, Hadoop) and then run general queires - data provenance becomes a huge issue then.
Then we have the data science approach of data in sets / key value stores that Inwoukd classify as predictive - K-nearest neighbour etc.
I suspect I am wildly wrong in many areas but just trying to get it straight
At the core of the author’s Show HN is an exact algorithm implementation / port for the all-pair similarity search. One of the steps of an all-pair similarity search, metric K-center, is an NP-complete problem. [1]
So we’ve got an exact algorithm that needs to solve an np-complete problem to produce a result, making it at least as hard.
Any speed increases to such an algorithm in the millions of data points is awesome! If you’ve got billions of data points chances are you can distill it down to millions, and if that’s possible you’d get an exact result. Or you could use a heuristic algorithm, some sort of polynomial-time approximation, which can scale to billions, and still get you a good-enough result.
1 - https://static.googleusercontent.com/media/research.google.c...
This is not correct. It's very obvious that all-pair similarity search can be solved in O(n^2) calls to the similarity metric, as stated in the readme. So unless the metric itself falls outside P, this problem is easy (but still hard to scale up in practice, of course)
Data science is about what you do with the data, not about how big the data is.
Update:
In IPython using pyhash library (C++):
import pyhash
h = pyhash.murmur3_32()
timeit h(b"test")
703 ns ± 4.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
import hashlib
timeit hashlib.sha1(b"test")
217 ns ± 5.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) import pyhash
h = pyhash.murmur3_32()
timeit h(b"test")
576 ns ± 3.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
import hashlib
timeit hashlib.sha1(b"test").digest()
518 ns ± 5.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
import mmh3
timeit mmh3.hash(b"test")
156 ns ± 0.704 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Worth keeping in mind that pyhash and mmh3 close out the hash immediately where hashlib needs digest() to call sha1_done().That said, mmh3 seems to give a respectable 70% speedup! (assuming 32bit hashes are acceptable) ... actually, let's compare apples to apples:
timeit mmh3.hash128(b"test")
180 ns ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Still pretty awesome for the little test!...
And for something more realistic:
timeit hashlib.sha1(b"test"*4096).digest()
21.9 µs ± 594 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
timeit h(b"test"*4096)
6.81 µs ± 38 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
timeit mmh3.hash128(b"test"*4096)
3.03 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
An order of magnitude! (well, close)