undefined | Better HN

0 pointsderefr5y ago0 comments

I’m not in infra or apps; I’m a database engineer. I write ETL pipelines.

If you know you have one more result, that is necessarily because your data pipeline actually has that result available for it to count; i.e. the result record/tuple has been loaded into the database’s memory, and the database has determined that that record/tuple is valid and fresh. (And at that point, rather than counting, the DB may as well send you that record itself. Just counting it has already required almost all of the same work!)

Remember that MVCC exists. You can’t know how much of something you have as of a given instant without doing version deduplication/application of tombstone records. This is the reason that COUNT() in Postgres takes minutes/hours on large (>1bn records) partitioned tables: you have to actually visit records, to see whether they’re still part of the current MVCC transaction-version, and therefore whether they should be contributors to the current count.

That applies whether or not you’re “counting” or actually streaming results. Given the architecture of both traditional data-warehouses — and of the map-reduce systems like Hadoop that are used to do reporting on data-lake data — you can’t know whether anything that’s in the rest of the data set is going to actually exist when you get to it. Your data warehouse might have a 100GB heap of data in a table, but everything after the first 1GB of it is dead tuples, such that after you’ve streamed the first 1GB of results, the rest of the streaming consists of the data warehouse sitting there silently for a minute or two (as it checks the liveness of those tuples) before saying “okay, nothing more, we’re done.”

And because of this, it’s not about precision. You can’t even guess. You can’t know whether you have one more result, or a billion more. Until you actually check them.

Yes, OLAP systems are different. OLAP systems operate in terms of infrequent batch inserts, giving the system time to build indices, generate counts, etc. in-between, that will all stay valid up until the time of the next batch insert. An index is, in a sense, a pre-baked answer of the “set of live tuple-versions” that a data-warehouse is holding. Count the table? Just return the size of the index. If you’ve only ever built OLAP-oriented systems, maybe it feels like these are the “simple, obvious” solutions to this problem.

But none of the things we’re talking about — global-web search engines, social-network timelines, marketplace listings — are OLAP systems. They’re OLTP. They constantly get new results in, and people expect to be able to see freshly-inserted data in the results as soon as they insert it. Data comes in at too high a rate to generate “dataset snapshots” ala ElasticSearch. The pipeline has to deal with data as it comes, doing as little to it as possible so that it can ingest it all at the ridiculous rates required, by pushing all the work of validating tuple liveness/freshness off to query-time.

And given that, OLAP properties don’t attain in such systems. It’s basically the CAP theorem at work: Consistency (and cross-shard index-building) both require time for the system to investigate itself; and you can’t get Availability (and/or cross-shard freshness) unless you run the system too fast to allow for that time.

0 comments

notacoward5y ago

> I write ETL pipelines.

"Apps" to folks in storage (like myself), networking, OS, etc.

j / k navigate · click thread line to collapse