undefined | Better HN

0 pointsjoelthelion3y ago0 comments

But why not create indexes? I mean, I understand why sometimes you're you don't want an index. But building an entire warehouse around the idea of "no indexes", really ?

0 comments

benreesman3y ago

My experience with "Big Data" is pretty dated, 5 years at least. At that time I think a good cutoff for "big data" might have been like a petabyte +/- a factor of 10 depending on your gear. I imagine now even 1PB is probably pretty mild by "big data" standards.

But once you're up in that "I can't even fit this in an 4-8U sled" territory (whatever it is in a given decade) you're probably doing some kind of map/reduce thing, so there's a strong incentive to have a column-major layout. If you can periodically sort by some important column so much the better (log2 n binary search), but mostly you've got a bunch of mappers (which you work hard to get locality on relative to the DFS replicas where the disks live, maybe on the same machine, maybe in the same top-of-rack switch or whatever) zipping through different columns or column sets and producing eligible conceptual "rows" to go into your "shuffle/sort/reduce" pipeline to deal with joins and sorts and stuff like that.

I don't know how Google does it, but I think most everyone else started with something like the Hadoop ecosystem and many with something like Hive/HQL to give a SQL-like way to express that job, especially for ad-hoc queries (long-lived, rarely changing overnight jobs might get optimized into some lower-level representation).

Around the time I was getting out of that game, Spark was starting to get really big, which was due to some combination of RAM getting really abundant and just kind of a re-think on what was by then a pretty old cost model. I have no idea what people are doing now.

I'd love it if someone with up-to-date knowledge about how this stuff works these days chimed in.

buttaphingas3y ago

It's all around the ethos of ease of use. Snowflake does a lot of smarts in the background so that you don't have the overhead of managing indexes. And not just indexes, there is just less human intervention required overall compared to something like Teradata or even a modern lakehouse.

That said, they've kind of introduced it with the Search Optimization Service, which is like an index across the whole table for fast lookups, but even that is automatically maintained in your behalf.

idunno2463y ago

these tend to be for one-off analytical queries. you want ever user with flag X >10 joined against five other tables each with similar filters. you don't know ahead of time what that query is, your analyst thought of it this morning, so you cant make indices ahead of time. and itll never run again so you don't need to take the performance hit keeping an index. and someone has to decide which indices to keep, but app engineers arent best utilized figuring out indices for analysts.

the indices is nice, but the bigger selling feature for me is if you have many services, and each services data are in the warehouse, you can join against them all together.

j / k navigate · click thread line to collapse