HSTORE can be fully indexed (gIST and GIN). Just have to roll your own object graphs for nesting if that's what you need to do.
I swear I have typed this exact same comment previously. Deja vu, maybe
Those "Indexes on Expressions" are really a great feature that can also be combined with XML (not just JSON) and any other types. I recommend everyone to have a look at those:
http://www.postgresql.org/docs/9.2/static/indexes-expression...
One of the reasons MongoDB is so popular is because it is an fantastic database for developers. As a Java developer I can deal in my code with sets, hashmaps, embedded structures and have it effectively map 1-1 in the database. It's akin to an object database meaning you can focus higher up in the stack.
With the SQL ORMs you can't avoid having to deal with the ER model.
MongoDB et all basically are built around the assumption that a schema is never worth the complexity. It's a bold claim that contradicts many decades worth of database research.
And for the record, we use both a SQL store, Redis and MongoDB where the use case suits it where I work.
I've always liked the paradigm of doing analysis on "slower" data stores, such as Hadoop+Hive or Vertica if you have the money. Decoupling analysis tools from application tools is both convenient and necessary as your organization and data scales.
PostgreSQL scales surprisingly well for this purpose, and is much nicer for interactive queries than Hadoop/Hive. We use Impala[1] for some larger datasets, but Impala is comparatively new, and it's nice to have something as battle-tested as postgres here.
As for the "why do we need realtime?": In my mind the benefit of a near-realtime replica is not that you actually often need it, but that it means you never have to ask the question of "Was this snapshot refreshed recently enough?", and never end up having to wait several hours for an enormous dump/load operation, when you realize you did need newer data.
[1] http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-t...
I do agree that PostgreSQL would be nicer for interactive queries. Waiting for a M/R to spin off is a bit of a buzzkill.
With regards to your usecases, what sort of questions have you found yourself answering the most? Do you have analytics applications running off of this?
How was your overall experience with impala ? Did you guys have a fairly new hive cluster to try it out or did you just spin up a new one since impala can only read certain file formats (i.e. no custom SerDe).
Also, for hive/hadoop datasets, is that more for just data exploration, while this PostgreSQL solution is for smaller datasets which return in a few seconds and would not perform well in hive due to the cost of setting up a mapreduce job ?
(In full disclosure, I wrote mongo_fdw for PostgreSQL.)
I'm thinking of this as something like polyglot memoization. Pretty cool when you think about it. Frequently need something that is slow in NoSQL, but fast in SQL? Memoize it to your SQL datastore. The alternative has always been to write it to two places. I kind of dig moving this out to the datastore to figure out.
I'm thinking that plenty of people will find this useful.
MongoDB is great for failover and for rapid development or prototyping. SQL is great for reporting or analytics, since you can do all kinds of aggregates and JOINs right in the database.
The edge cases where you can't represent the data perfectly aren't a huge deal for this use case -- because it's a one-way export, you don't have to be able to round-trip the data, and as long as you can export the data you want to run analysis on, it doesn't matter if there's some you can't get.
SELECT c.email FROM customers c, subscriptions s WHERE c.subscription_id = s.id AND s.status = "active" and s.trial_start IS NOT NULL;
(where of course the customer and subscription tables would be a virtual view on your customers and subscriptions)
This kind of comment shows how little knowledge you have about NoSQL and SQL. Is not a SQL vs NoSQL, it's about using the right technology for the job.
The question is perfectly valid. In many scenarios (not necessarily Stripe's), PostgreSQL is fast enough to do the job. Stop putting people down for legitimate engineering questions.
Try not to be condescending and your point will be better received. "Right technology" as I'm sure you're aware, has as much to do with subjectivity as appropriateness. Familiarity, workflow, ease of use (and did I mention familiarity?) cannot be overstated even when the perceived benefits are considered.
Read: religion.
Some of the people who rally against NoSQL may be deriding it from a knee jerk reaction, however others are simply frustrated with developers who, as Ted Dziuba would say, "value technological purity over gettin' shit done".
Relational databases were created in the first place to solve these very problems around transactionality and analytics for finance.
This library is a beautiful example of reinventing the wheel, and otherwise creating a patchwork of unnecessary - and ultimately brittle - infrastructure.
https://github.com/10gen-labs/mongo-connector/tree/master/mo...
Seems to be high quality, and supports replica sets.
I'd also like to mention a project I've been contributing to, Mongolike
[My fork is at https://github.com/e1ven/mongolike , once it's merged upstream, that version will be the preferred one ;) ]
It implements mongo-like structures on TOP of Postgres. This has allowed me to support both Mongo and Postgres for a project I'm working on.
I thank them for releasing this.
It's much more effective and efficient to use a SQL query than it is to throw together a huge amount of imperative JavaScript code (that's usually very specific to a single NoSQL database, as well) merely to perform the equivalent query.
It's much safer to use a database that offers true support for transactions and constraints, rather than trying to hack together that functionality in some Ruby or PHP data layer code, or relying on some vague promise of "eventual consistency", for instance.
It's much more maintainable, and leads to higher-quality data, to spend some time thinking about a schema, rather than just arbitrarily throwing data into a schema-less system, and then having to deal with the lack of a schema throughout any application code that's ever written.
Aside from an extremely small and limited handful of situations (Google and Facebook, for instance), relational databases are the best tool for the job.
Honestly. I don't think you could be more misinformed if you tried.
Hint: Google "Big Data".
Look at old NoSQLs: Intersystems Cache got SQL interface, GT.M (in PIP-framework) also got SQL.
My impression that MongoDB looks a lot like MUMPS storage with globals in JSON.
I actually played with mongo_fdw. At this point, it's a really cute hack, and useful for some things, but it doesn't give Postgres enough information and knobs to really let the query planner work effectively, so it ends up being really slow for complex things. I do love the concept, though.
hey, one way to do that is to use the MongoDB foreign data wrapper - also mentioned in some of the earlier threads.
mongo_fdw (https://github.com/citusdata/mongo_fdw) allows you to run SQL on MongoDB on a single node. Citus Data allows you to parallelize your SQL queries across multiple nodes (in this case, multiple MongoDB instances) by just syncing shard metadata. So you would effectively run SQL on a sharded mongo cluster without moving the data anywhere else.
another idea could be to use MoSQL to neatly replicate each mongo instance to a separate PostgreSQL instance, and then use Citus Data to run distributed SQL queries across the resulting PostgreSQL cluster.
I have read than in version 2.x they announce some features, so, it is greatness?