undefined | Better HN

0 pointsjacquesm11y ago0 comments

Take a graph, you could use an SQL database to store it and do your graph analysis using SQL, or, alternatively, you could convert your graph to an extremely compact in-memory format and then do your analysis on that. Much better efficiency for the same size problem, bonus: you can now analyze much larger graphs with the same hardware.

0 comments

e12e11y ago

Or maybe a bit of both: http://stackoverflow.com/questions/27967093/how-to-aggregate...

I appreciate you taking the time to answer -- and I get that there's a reason for why we have graph databases. But I really meant something more concrete, as in here's a real-world example that isn't feasible to do on machine X with postgresql, but easy(ish) with a proper graph structure/db -- rather than "not all data structures are easy to map to database tables in a space-efficient manner".

jacquesmOP11y ago

Ok, one more example: A German company holds a very large amount of profile data and wanted to search through it. On disk storage in the 100's of gigabytes. Smart encoding of the data and a clever search strategy allowed the identification of 'candidate' records for matches with fairly high accuracy, fetching the few records that matched and checking if they really were matches sped things up two orders of magnitude over their SQL based solution.

It's very much dependent on how frequently you update the data and whether or not (re)loading the data or updating your structure in memory can be done efficient or not to determine whether or not such an approach is useful or not but going from 'too long to wait for' to 'near instant' for the result of a query is a nice gain.

In the end 'programmer efficiency' versus 'program efficiency' is one trade-off and cost of the hardware to operate the solution on is another. Making those trade-offs and determining the optimum can be hard.

But a rule of thumb is that a solution built up out of generic building blocks will usually be slower, easier to set up, will use more power and will be more expensive to operate but cheaper to build initially than a custom solution that is more optimal over the longer term.

So for a one-off analysis such a custom solution would never fly, but if you need to run your queries many 100's of times per second and the power bill is something that worries you then a more optimal solution might be worth investing in.

e12e11y ago

> A German company holds a very large amount of profile data and wanted to search through it. On disk storage in the 100's of gigabytes. Smart encoding of the data and a clever search strategy allowed the identification of 'candidate' records for matches with fairly high accuracy, fetching the few records that matched and checking if they really were matches sped things up two orders of magnitude over their SQL based solution.

Right. But you've still not made a compelling argument that the "smart encoding of the data" and "clever search strategy" couldn't be done with/in SQL. I'm quite sure you could probably get a magnitude or two of improvement by diving down outside of SQL -- but there's a big difference between using the schema you have, and setting up a dedicated system.

I was perhaps not clear, but I was wondering if it's not often the case with real-world data, that you could create a new, tailored schema in SQL, and do the analysis on one system -- especially if you allow for something like advanced stored procedures (perhaps in C) and alternate storage engines.

I suppose one could argue when something stops being SQL, and start becomming SQL-like. But the initial statement was that "For some problems SQL would already be way too much overhead.". The implication being (as I understood it, at least) -- that you not only have to change how the date is stored and queried, but that SQL databases are unsuitable to that that task -- and that by going away from your dbs, you can suddenly do something that you couldn't do on a system with X resources.

I'm not saying that's wrong -- I'm just asking if a) that's what you mean, and b) if you can quantify this overhead a bit more. Are using bitfields or whatnot with postgres going to be a 20% overhead, or a 20000% overhead in your scenario? Because if it's 20%, that sounds like it can be "upgraded away".

1 more reply

j / k navigate · click thread line to collapse

0 comments

e12e11y ago

Or maybe a bit of both: http://stackoverflow.com/questions/27967093/how-to-aggregate...

jacquesmOP11y ago

e12e11y ago

1 more reply

j / k navigate · click thread line to collapse