Many businesses (including startups) have moved to using document stores for high read environments and scraping nightly drops to their backend analytics systems. This is smart - you don't want to run summing/aggregation on a live transactional system for (hopefully) obvious reasons.
EDIT: it's also worth noting that map/reduce is typically much more powerful when aggregating large datasets. When trying to run analytics on top of a transactional system, developers like Ray here would end up with multiple joins and groupings - all of which slow everything down. Map/reduce certainly isn't perfect, but the author dismisses it as difficult witchcraft when, in practice, parallel execution of MR queries can greatly decrease resources and time to information.
I sort of think we've moved beyond this discussion.
I think this article is targeted at the latter: pre-MVP and just post-MVP. For those startups, having two databases with one dedicated to a backend analytics system reeks of premature optimization.
If you grow to be more than 2-people, then taking a mirror dump of the prod db to run queries against is pretty trivial effort.
Set up a slave of your MySQL or Postgres database (no the slave is not a backup, but dumping it regularly is an easy first step to a very basic backup setup), hosted in a different data centre, and if you're a two person company you now also have somewhere to run analytics whenever you feel ready.
It can be <1 hour effort and a few tens of dollars a month in extra costs for a small system, yet makes a tremendous difference in resilience and gives you that db to run analytics against "for free" whenever you do need it.
Its free software, you don't have to pay for two instances of Oracle.
One thing that will quickly kill a biz is combining the functions of PROD and DEV/TEST. Making the DEV/TEST box the DEV/TEST/REPORTS box is not a big deal, and you can't run a (real) biz without a DEV/TEST box.
You can use a SQL query for this or a simple map/reduce - either way my argument is to focus on transactional design that works best for your system - don't conflate reporting needs with your live system (i.e. BA queries).
Also, those mistakes in modelling the domain and in enforcing the constraints are often there to stay and slowly become impossible to fix, once you have 10000 records that do not fall into a few well specified states, it's hard to go through all of them, find some common denominators, and migrate the database. Not to mention that with the mess people can do in the code, and with the messy stack in use today, it's easy to introduce bugs that might be hard for anyone to notice but seriously harm your business.
The amount of fashionable nonsense in software engineering seems to be higher than ever, unfortunately.
This is a startup. There is no problem domain. There is no spec. There's none for a startup. You may not even have code.
If you're doing a "startup" but have this ironed out, great, everything you said then applies, but it's not a startup, since you found your business model, it's a small company.
And I would love to enforce constraints on the DB but unfortunately, I already had "primary keys" that repeat, unbeknownst to the project customers.
Reinventing the wheel is great fun. Without constraints you can appear to be amazingly productive. But it is all a fantasy because without constraints and a rigorous process you are probably piling up technical debt faster than anything else.
Relational DBs may even be the wrong choice for some specific problems.
And about this quote from the article: "At some point, you will need to ask your primary database questions. If you chose the wrong database, this is where things get tricky. "
Yes, this is correct. However, I know how to read the manual of whatever db I'm using and maybe code something simple to process the output to the format of my liking.
It is the same thing for choosing Linux distribution and JDK mode. See the references here:
http://bingobo.info/blog/table-of-contents.jsp
BTW, your title should have "relational" instead of "relation".
This is not to say you couldn't turn Postgres into something similar. Put the data on a ramfs, relax various write guarantees (lots of knobs in PG) etc.
It depends in which NoSQL database. Depending on your problem, you can probably find a NoSQL DB optimised for it. It will often be unclear if that NoSQL DB is actually better than a relational database until you try it.
Examples:
High write throughput: Cassandra
Simple key-value: Redis
Text search: Solr/Elastic Search
etc..
That said, i would hesitate to describe Redis as a database at all. A key characteristic of databases is that they store every write in a durable way. Redis can checkpoint its state periodically, but as i understand it, it either can't or typically isn't used to safely keep every write. Redis is something in between a database and memcached. I doubt there's ever a situation where you have to choose between PostgreSQL/Cassandra/CouchDB and Redis; Redis is something you would use in addition to a database.
As for text search - RDBMSs have full text search, and at least in the case of PostregSQL, it seems pretty good - see slide 49 in http://es.slideshare.net/billkarwin/full-text-search-in-post... from 2009. You might not want to be leaning on your database for text search when you're at scale (for operational reasons more than performance ones), but it's a plausible way to start.
For areas other than structure, the difference can be more like the difference between a general purpose programming language and a domain-specific language. SQL databases are general purpose, so they offer a large, nicely integrated package of features, while NoSQL databases often sacrifice some of those features to optimize for some niche. From this perspective, the SQL database can be the best choice for a startup trying to find its niche with the NoSQL chosen as an optimization once that niche has been identified.
The fact that a new business can't be sure what its business will end up being is an argument both for and against using a SQL database.
Now, if you are doing anything more mundane (like, for example, logging every bank operation in a country) that's overkill, and non-local constrains are a great thing to have. Thus, I'd advise not using it unless you have a clear necessity.
And always keep in mind that changing yor data is hard, whatever technology you are usig. Relational databases making the change itself hard, while schemaless databases make maintaining the software that reads it hard. Schemaless does give you a small upfront benefit, for a huge cost later.
RDBMS = swiss army chainsaw, can do anything although maybe painfully and slowly.
NoSQL = 5/16th inch torque pre-set wrench for tightening sma connectors to 10 inch-lbs torque. Can't do much else, but its near perfect at its tiny little area of expertise.
On a meta level its a battle between hardware and applications, where app demands have (momentarily, of course) fallen behind commodity hardware ability.
[1] http://technet.microsoft.com/en-us/library/ms191497.aspx
[2] http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.j...
[3] http://docs.oracle.com/cd/B28359_01/appdev.111/b28369/xdb_in...
But again different NoSQL DBs have different purpose.
As far as I know, most graph databases support transactions and offer great scalability. Such databases are also schema-less and can be queried with Gremlin, a powerful graph traversal language (see www.tinkerpop.com).
With respect to scalability and transactions, Titan (http://thinkaurelius.com/) looks very promising: it supports various backends for storage (Cassandra, HBase, etc.) and indexing (currently Elastic Search and Lucene). Graph analytics can be done via Faunus (http://thinkaurelius.github.io/faunus/), backed by Hadoop.
There are other vendors out there (Neo4J, OrientDB, etc.) which offer interesting solutions worth looking at - I'm just a bit less familiar with them.
The major downside I see with graph databases is that most of them are fairly recent and their ecosystem is tiny (though growing). Should a startup venture on such young technologies, or stick to mature and battle-tested solutions (ie. relational databases)?
Could startups use this kind of graph "NoSQL" databases? I don't see why not. If your startup is some kind of social network, graph databases are certainly an option worth considering. If I were to create a startup, I'd hardly use a document database like MongoDB but I will really consider using a graph database. In the end, it's all about having the right tool in hand, and knowing how to assert what is "right" for you.
So you properly normalized your entire system, customer billing transaction records all the way up to article tags. Then article tags gets too huge. So next version looks at RDBMS and Redis, and the next version after that only looks at Redis. Customer billing transactions remains on a "real" DB and the tag cloud lives on redis. And the problem with that is... what exactly?
Its obsolete thinking. I can't have two databases because we're a poor startup and the only databases that exist are DB2 and Oracle and everyone knows they're super expensive so super expensive times two is unaffordable. Dude, its almost 2014 not 1980, Postgres/mysql/redis its all free.
Thing is. Most SQL database at scale is a bit of a horror too. Have you seen real-life production relational databases? Gawd. Hacks on hacks. Then you add another database. And another analytics database. And a bunch of point to point data feeds. Argh.
But hey. That's data.
If you think choosing SQL will solve your analytics woes down the line -- it's just not true. You're in for some pain no matter what you do.
... That's unless you get a porcelain schema first time. Which, if you're in a startup, probably means you're working on the wrong problem.
That's not an argument for using NoSQL (I used MongoDB daily, but I've got plenty of love for PostgreSQL). It's a rebuttal that SQL magically solves a different problem.
A relational database might be just as wrong as a nosql one if all you need is redis.
Then you get a little closer to the truth.
"All the while moving work onto the developers to standardize how they handle different migration cases."
I know a startup is fast and bla bla... BUT your team should know the tools that you are using... For me SQL DB's force me to add a new field and some kind of value and i don't like to be forced to a solution.
"In document stores, you have two choices: store related data as sub-documents, or store related data as separate documents with references. It is up to the developers to understand the trade-offs of both approaches. Selecting one over the other can lead to performance gains or issues, scalability issues and above all, make asking certain questions of the data a lot harder."
Again know the tools you are using. And for example MongoDB has good ORM's too.
"But that takes much more forethought and is dependent on a particular problem."
If your startup is doing something new and shiny you don't have the knowledge and forethought and you often dont know what particular problem will come at you.
Most of the point's look like: You learned at your University SQL now you know it(but in really life you don't) and now use it because you know how to normalize a Database. This argumentation is often used to say why java is so great or why javascript is bad.
I personally started with php then moved to rails and now to meteor(uses MongoDB) and we never before meteor could make so fast a good prototype which for a startup is very important.
So yeah if you are comfy with SQL use it if your comfy with NoSQL use it.
However relational databases will have hard time with big data because your dataset is bigger than your database and you have no relational integrity.
Just because you're storing a huge amount of one specific type of data, that doesn't prevent you from taking advantage of the features of a relational database.
Why not?
1) Relational databases typically aren't optimised for write-throughput. It's quite possible to do it, but you'll need fast and large disks (eg, FusionIO in a SAN or something).
2) Location-tracking applications typically don't require interactive queries - generally it is more a batch-based system that can be run offline.
Saying you are not going to be able to use a relational database is overstating it a bit in my view.
Clearly you can make it work, but something like Cassandra will give you better write thoughput, won't force you to rely on a SAN/NAS for data storage and will let you use Map/Reduce to batch process the data.
If you have a billion customers each with a million records you are the Google Maps location thingy and are not a startup anymore and a relational solution may, or may not, work.
If you have ten users the ideal database is probably ten interns and some whiteboards. I'm not kidding.
I observe there is a common claim brought up every week on HN about fake it till you make it. You don't automate (insert menial task here) until you have ten customers.
I observe there is another common claim brought up every week on HN about how you need a massively scalable design which can never change from day one, because your schema, either formal or informal, is perfect and unchanging, LOL.
Those two stylistic outlooks are not compatible.
Start with something that has too many features, too many abilities, too much room for expansion, like a relational DB, and then later on if you need to, put some stuff on another platform. IF you need to. IF your company survives. Lots of IF.
But one can argue that not careful with schema design can break api and make codebase messy.
I guess I will stick with the hard work now... I guess not careful with schema will definitely bite me.
I find that a poor argument. One can use an ORM that automatically creates a schema based on classes. E.g. I like Ebean with DDL generation. You just write classes and add @Entity annotations. Ebean automatically creates the schema. Combine this with an embedded database, such as h2, and there is virtually nothing to set up.
Once you are out of the rapid iteration phase, you can take the latest Ebean generated schema and use to proper migrations for later changes.
So you don't have to create bullshit m2m tables with tedious joins for a fucking tagging system
It is not true that relational databases must have a single point of failure (SPoF) or must use failover: MySQL Cluster is a sharded multi-master distributed database without a SPoF.
On the other hand Redis, for example, is a master-slave failover NoSQL datastore.