He talked about how databse world is about to change. ACID is really expensive in terms of resources, and so are the more difficult things about relational schema (foreign keys, checks, etc). And architecture of classic RDBMSes is pretty wasteful -- they use on-disk format but cache it in memory.
He talked about how there are basically three new paths for DBMSes to follow. 1) Some drop the restrictions to become faster. This is the NoSql stuff, because you don't really need ACID for writing to Facebook wall.
This is called NoSql database.
2) OLAP, in data warehousing, the usual way to do things is that you load ridiculous amount of data into database, and then run analytical queries, that tend to heavily use aggregation and sometimes use just few dimmensions, while the DWH data tend to be pretty wide.
For this, column store makes perfect sense. It is not very quick on writes, but it can do very fast aggregation and selection of just few columns.
This is called Column store.
3) In OLTP, you need throughtput, but the question is, how big are your data, and how fast do they grow? Because RAM tends to get bigger exponentially, while how many customers you have will probably grow linearly or maybe fuster, but not much. So your data could fit into memory, now, or in future.
This allows you to make very fast database. All you need to do is to switch the architecture to memory-based, store data in memory format in memory and on disk. You don't read the disk, you just use it to store the data on shutdown.
This is called Main memory database.
No, that was the presentation. It was awesome, and if someone can find it, please give us a link! My search-fu was not strong enouhg.
...
What interests me is that we have NoSql databases for some time already, and we have at least one huge (are very expensive) column store: Teradata. But this seems to be first actual Main memory database.
My dream would be to switch Postgres to main memory or column store mode, but I guess that's not happening very soon :)
Eh, not really...
This is exactly what SAP has been doing for several years via Hasso Plattner and the Potsdam Institute: https://epic.hpi.uni-potsdam.de/Home/HassoPlattner
If you've ever worked with large scale "enterprise" database warehouses, they tend to be slow and clunky. Back in 2006ish SAP took the whole Data Warehouse (well mainly just the data cubes) and chucked it into a columnar database (at the time it was called TREX, then became BW Accelerator) - http://en.wikipedia.org/wiki/TREX_search_engine
TREX exist way before 2006. SAP also bought a Korean company called P* (IIRC) which did non-columanr (traditional relational) and threw it into memory. SAP also had a produce called APO LiveCache - http://scn.sap.com/community/scm/apo/livecache - which lived around the same time.
This has now all evolved to a standard offering called SAP HANA - http://www.saphana.com/welcome - In it's second year of inception I believe SAP did roughly $360m in sales just on HANA alone.
Also, IIRC is InnoDB basically the open source version of exactly what you're talking about with "Postgres to main memory"?
edit- correction in TimesTen
In particular, his criticism of traditional databases seems based more on philosophy rather than evidence.
I'd advise reading both sides of the story:
http://lemire.me/blog/archives/2009/09/16/relational-databas...
http://lemire.me/blog/archives/2009/07/03/column-stores-and-...
http://architects.dzone.com/articles/stonebraker-talk-trigge...
http://gigaom.com/2011/07/11/amazons-werner-vogels-on-the-st...
http://dom.as/2011/07/08/stonebraker-trapped/
The date on some of those posts in interesting. 2009 is quite a while ago now, and I'd suggest that columnar datastores haven't exactly taken over. Some implementations have made some progress (eg Cassandra), but OTOH many non-traditional datastores have added traditional-database like features (eg, Facebook's SQL front end on their NoSQL system), and traditional databases have added NoSQL features too.
If it can be done besides the traditional architecture, be it in a fork or without touching existing code; and if you can at least start the work, it could happen soon.
What you're talking about was the HAppS-State component of the HAppS application server, a project which is in deed not active anymore. Happstack is the active fork of HAppS and had a "happstack-state" component for a while, but this was eventually rewritten from scratch and made independent of Happstack and is now known as acid-state [1]. It's even used for the new Hackage server that powers the Haskell package ecosystem.
- a bunch of the novel components (the UPS aware persistence layer, for example) aren't actually built yet
- they're pushing for people to build businesses on it already. I would characterize it as "bleeding-edge with bits of glass glued on", so this doesn't seem entirely honest.
- there's mostly a lot of breathless talk about how great and fast and scalable it is, but no mention of CAP theorem. To boil down their feature set, it's an in-memory RDBMS using the Actor model.
Regarding CAP, I'm not addressing multi-site availability at this stage--I want to get single site fully operational, redundant, and so on.
And, yes, to boil it down, it's an in-memory RDBMS using the Actor model.
The most important feature is that it performs transactions involving records on multiple nodes better than anything. This is the workload that keystores functionally cannot do, and which other distributed RDBMS' suffer under. It's also open source, has over 100 pages of technical docs, and is functional enough for people to pound on with some workloads--but not something to put into production yet.
edit: There's a bit of discussion further down about the SQL implementation. That's something I was very curious about as well. The projects linked below spend a lot of time working on supporting full ANSI SQL, and reducing latency by pushing down as many operations as possible. The Overview page doesn't appear to mention how filtering, aggregation, windowing, etc. work in your system.
Also, I noticed on your website that you compare InfiniSQL to Hadoop. How do you feel it compares to Impala (http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-t...) and Shark (https://amplab.cs.berkeley.edu/projects/shark-making-apache-...)?
"In fact, InfiniSQL ought to be the database that companies start with--to avoid migration costs down the road."
This is far from looking for hackers and early adopters. I understand that you're enthusiastic about something you've created, but let's be reasonable for a moment. I'm more than happy to try this out, but starting my business on something that isn't proven yet and has a fair number of durability features yet to be implemented is a no, no.
Edit: My posts on this thread may be coming off as negative and that's not what I intended. I'm cautious of new technology that purports to deliver the world to me on a silver platter. That said, I'd be happy to throw everything I've got at it to see what shakes loose so you can get this to production quality sooner.
If you haven't answered that question yourself, it's likely you have a partition-intolerant system, and that in real-life scenarios people will lose data.
How does a UPS ensure durability against system or program crashes, disk corruption in large clusters, and other failures that can affect a simple write()?
> The real killer for database performance is synchronous transaction log writes. Even with the fastest underlying storage, this activity is the limiting factor for database write performance. InfiniSQL avoids this limiting factor while still retaining durability
How do you plan to implement this (since it appears it hasn't been implemented)? What is your fundamental insight about synchronous transaction logs that makes InifiSQL capable of being durable while (presumably) not having a synchronously written transaction log? If your answer is the UPS, please see my first question.
Edit: I don't see any mention of Paxos anywhere. Could you explain what you're using for consensus?
The fundamental insight about not needing transaction logs is pretty simple actually: if the power is guaranteed to either stay on, or to allow the system to quiesce gracefully, then the cluster will not suddenly crash. That's the motivator for transaction logs--to make sure that the data will still be there if the system suddenly crashes. Get rid of the need for transaction logs, get rid of the transaction logs.
Regarding consensus, I expect that there will be a quorum protocol in use amongst an odd number greater than 2 of manager processes, each with redundant network and power. But the specific protocol I haven't ironed out. If there's something I can grab off the shelf then it may be preferable to implementing from scratch, but I haven't gotten there yet.
This stuff hasn't been implemented yet, but the core around which it can be implemented, has been.
Do I sense a volunteer? ;-)
Kudos for engaging the community though; please do keep us posted as you progress.
You may get more volunteers by publishing a paper outlining the core concept. I clearly remember reading things like the H-Store paper, or the Dremel paper, and saying "damn, this makes sense and is really cool". Implementation details can be worked out and engineering approaches tried. But the underlying concept should be clear.
Clause 13 is a real pain to deal with when exposing this over the network.
I guess the developer wants to sell a license (like the mysql java client GPL'ing).
Can't blame him, he needs to get paid.
I thankfully don't have to - this means I don't need to talk to lawyers about this.
Because AGPL took away the most important bit of unassailable ground I had to argue with when it came to deploying GPL - "Using this code implies no criteria we have to comply to, only if we distribute it".
Clause 12 and 13 - basically took that away from me completely.
Look, I'm not going to tell you what license to use.
But leave me enough room to complain that I have had trouble convincing people that we can use AGPL code in a critical function without obtaining a previous commercial license by paying the developer.
On their site though, it says no sharding and that it can do these 500Ktx/sec even when each transaction involves data on multiple nodes. Does this performance degrade directly in relation to the number of nodes a tx needs to touch?
A simple, straightforward, wire-level description of how things work when coordinating and performing transactions across would be very useful. There's a lot of excited talk about actors, but nothing that really examines why this is faster, or any sort of technical analysis.
If you want more details about how things work when performing transactions, I think that the overview I created would be a good starting point. It probably doesn't have everything you'd ask for, but I hope that it answers some of your questions: http://www.infinisql.org/docs/overview/
And to answer your question about performance degradation pertaining to number of nodes each transaction touches, I have not done comprehensive benchmarks of InfiniSQL measuring that particular item. However, I do believe that as multi-node communication increases, throughput will tend to decrease--I expect that the degradation would be graceful, but further testing is required. The benchmark I've performed and referenced has 3 updates and a select in each transaction, all very likely to be on 3 different nodes.
I'd like to invite you to benchmark InfiniSQL in your own environment. I've included the scripts in the source distribution, as well as a guide on how I benchmarked, as well as details on the specific benchmarking I've done so far. All at http://www.infinisql.org
I'd be glad to assist in any way, give pointers, and so on, if there are tests that you'd like to do. I also plan to do further benchmarking over time, and I'll update the site's blog (and twitter, etc) as I do so.
Please communicate further with me if you're curious.
Thanks, Mark
InfiniSQL says don't worry, we'll just use 2PC. But not just yet, we're still working on the lock manager.
I look forward to your exegesis of how you plan to overcome the well-documented scaling problems with 2PC. Preferably after you have working code. :)
I'm not much of an expert at all, but I like reading papers on databases. It seems to me that if you really did discover a breakthrough like this, you should be able to distill it to some basic algorithms and math. And a breakthrough of this scale would be quite notable.
If I'm reading correctly, there's no replica code even involved ATM. So 500Ktx/s really boils down to ~83Ktx/sec per node, on an in-memory database. Is it possible on modern hardware that this is just what to expect?
I am curious, and I'm not trying to be dismissive, but the copy sounds overly promising, without explaining how, even in theory, this will actually work. I'd suggest to explain that part first, then let the engineering come second.
I can't seem to find the word "Reliable" or any variation thereof anywhere in there.
In fact, that word is no where to be found on the blog post or on the entire InfiniSQL page (not in the Overview, Guides, Reference or even FAQ). I find this quite remarkable since reliability is the true virtue of an RDBMS, not speed or even capacity. At least that's what PostgreSQL aims for and this being another RDBMS, and is also open source, I see it as InfiniSQL's only direct competitor.
It's nice that this is scalable, apparently, to ridiculous levels, but if I can't retrieve what I store in exactly the same shape as I stored it, then that's a bit of a buzz kill for me.
Can we have some assurance that this is the case?
There's a note on "Durability" and a shot at log file writing for transactions, and presumably InfiniSQL uses concurrency and replicas, to provide it. In the Data Storage section, it mentions that InfiniSQL is still an in-memory database for the most part http://www.infinisql.org/docs/overview/#idp37053600
What they're describing is a massively redundant, UPS backed, in-memory cache.
Am I wrong?
I promise that I have every intention of making InfiniSQL a platform that does not lose data. I have a long career working in environments that demand 100% data integrity. If I de-emphasized it, it was not intentional.
PostgreSQL doesn't scale for OLTP workloads past a single node. There are a handful of products similar to InfiniSQL (google for the term NewSQL for a survey of them).
And yes, a redundant UPS-backed in-memory cache. I have some ideas on how to do regular disk backing as well (which I'm sure you've read).
And if a more traditional log-based storage layer is added, InfiniSQL will still scale nearly linearly across nodes horizontally. Multi-node scale and in-memory are not dependent on one another. Though I believe that redundant UPS systems managed by a quorum of administrative agents, and provide durability just like writing to disk.
Are you familiar with high end storage arrays, such as from HDS or EMC? They write to redundant memory, battery backed and managed by logic in the arrays. I'm just moving that type of design to protect the database application itself, up from the block layer.
And some people trust their datacenter power--they use pure in-memory databases without UPS already, or they do things like asynchronously write transaction log, which also sacrifices durability. For those groups, InfiniSQL ought to be just fine, without UPS systems.
But I agree that other write (and read) activity going on in the background and foreground, also limits performance--and in fact, I've seen the index write bottleneck that you describe in real life, more-so than simple transaction log writes. So, you're correct.
I've read about Toku, but I really doubt that it writes faster to disk than writing to memory. Are you really trying to say that?
I think it would be great for InfiniSQL to be adapted to disk-backed storage, in addition to memory. The horizontal scalability will also apply, making for a very large group of fast disk-backed nodes.
I think your input is good.
If you're planning to write data faster than disk bandwidth, then you have no hope of being durable and we're talking about problems too different to be worth comparing, and in that case I retract my comment.
I don't understand what distinction you're trying to make between the "array itself" and the "log file has a limit to how many blocks can be appended". Can you clarify what limit you're talking about?
yes, the issue usually isn't the transaction log append speed. Instead, it happens too frequently that the log is configured to be too small. A log file switch causes a flush of accumulated modified datablocks of tables and indexes [buffer cache flush in Oracle parlance] from RAM to disk. With small log file size, the flush happens too frequently for too small amounts of modified data - this is where GP mentioned random IO bites in the neck.
If you actually have transaction processing at this scale and need that performance, the RAM cost is not a major issue.
It scales as long as throughput increases while new nodes are added. I've done benchmarking up to 12 nodes, and it continued to scale nearly linearly. (http://www.infinisql.org/blog). I'd like to push it further, but need $$$ for bigger benchmark environments.
I hope there's room for competition in this space still.
#line statements because I get compiler messages from time to time putting things on the wrong line after having imported headers.