undefined | Better HN

0 pointsmseebach9y ago0 comments

Someone once suggested to me that 'big data' begins when it doesn't fit in RAM in a single rack any more.

0 comments

Yup, that's essentially what looking that byte-size means. However, just because it doesn't fit in memory might not make it big data if it's just poorly engineered.

But many times this happens because of wasted or bloated indexes that aren't useful. Or it happens when data types are picked incorrectly.

For example, I once worked on a database where the original developer used Decimal(23, 0) as a primary key. This was on MySql and that ended up taking up 11 bytes per row, versus a Long which would have just been 8. In one table, maybe not so bad but when you start putting those primary keys into foreign key relationships... we ended up with a 1 billion row table in MySql that had 4 of these columns in it. That might make it "big data" by that definition but it's also just bad design.

Another example in that same database was using text fields in mysql for storing JSON. Since text fields in mysql are stored as separate files, this meant that every table that had one (and we had several tables that housed multiple) ran in large IO and disk access issues.

"big" data is probably a bad term to use these days because of easy it is to accidentally create a large volume of data but not need a big data solution outside of the fact that it's not the business that needs it, it's the poorly implemented system that does.

But the real reason we talk about fitting in memory comes from the core of the issue: IO. Even a super large memory set could end up being slow if it's postgres and single threaded reader that's scanning a 500 GB index. AWS offers up to 60 GB/s memory bandwidth and we'd need it for this index, since that would still take almost 10 seconds to warm up the indexes in the first place.

voltagex_9y ago

>Since text fields in mysql are stored as separate files

Bwuh? Over in MS SQL you just go for an NVARCHAR and forget about it. What is the right way to store this data (if you really do need to store the JSON rather than just serializing it again when you get it out of the DB)

chewbacha9y ago

varchar is different than a text field in mysql: http://dev.mysql.com/doc/refman/5.7/en/blob.html

It stores text fields as blobs.

I suppose now the right way would be the json data type. It didn't exist when I was working with these servers though (or they were on a much older version of MySql) https://dev.mysql.com/doc/refman/5.7/en/json.html

jules9y ago

That's soon going to be on the order of 100 terabytes, so there will be only a handful of companies doing big data ;-)

imtringued9y ago

I'm only aware of servers up to 12TB. Care to elaborate?

jules9y ago

He/she said a whole rack of servers. I actually took 30 servers of 2TB each and rounded up to 100. With 12TB per server it will already be over that.

j / k navigate · click thread line to collapse

0 comments

chewbacha9y ago

Yup, that's essentially what looking that byte-size means. However, just because it doesn't fit in memory might not make it big data if it's just poorly engineered.

But many times this happens because of wasted or bloated indexes that aren't useful. Or it happens when data types are picked incorrectly.

voltagex_9y ago

>Since text fields in mysql are stored as separate files

chewbacha9y ago

varchar is different than a text field in mysql: http://dev.mysql.com/doc/refman/5.7/en/blob.html

It stores text fields as blobs.

jules9y ago

That's soon going to be on the order of 100 terabytes, so there will be only a handful of companies doing big data ;-)

imtringued9y ago

I'm only aware of servers up to 12TB. Care to elaborate?

jules9y ago

He/she said a whole rack of servers. I actually took 30 servers of 2TB each and rounded up to 100. With 12TB per server it will already be over that.

j / k navigate · click thread line to collapse