Announcing MoSQL (opens in new tab)

(stripe.com)

390 pointsnelhage13y ago113 comments

113 comments

FYI you can store unstructured data in PostgreSQL (and query it) with the introduction of hstore. So knock one more reason to use MongoDB instead of PostgreSQL off your list. (Disclaimer: the length of my list to use MongoDB has always been a constant that is less than one.)

http://www.postgresql.org/docs/9.1/static/hstore.html

mrkurt13y ago

Wow, hstore really isn't a great alternative to an actual document DB. The "better" Postgres option would be a JSON type and functional indexes.

shawn-butler13y ago

There is a JSON type but it just validates content.

HSTORE can be fully indexed (gIST and GIN). Just have to roll your own object graphs for nesting if that's what you need to do.

I swear I have typed this exact same comment previously. Deja vu, maybe

2 more replies

vog13y ago

> JSON type and functional indexes

Those "Indexes on Expressions" are really a great feature that can also be combined with XML (not just JSON) and any other types. I recommend everyone to have a look at those:

http://www.postgresql.org/docs/9.2/static/indexes-expression...

1 more reply

taligent13y ago

> So knock one more reason to use MongoDB instead of PostgreSQL off your list.

One of the reasons MongoDB is so popular is because it is an fantastic database for developers. As a Java developer I can deal in my code with sets, hashmaps, embedded structures and have it effectively map 1-1 in the database. It's akin to an object database meaning you can focus higher up in the stack.

With the SQL ORMs you can't avoid having to deal with the ER model.

gfodor13y ago

Of course, the problem with that approach is you don't have anything enforcing any sort of data integrity below the application. In my experience most of the time you actually can put down on paper a schema and a set of rules the data should obey without too much fear of it changing dramatically. The nice thing about hstore is it allows you the flexibility to introduce unstructured data in just the places where a schema is unknowable or not worth the complexity.

MongoDB et all basically are built around the assumption that a schema is never worth the complexity. It's a bold claim that contradicts many decades worth of database research.

1 more reply

wheaties13y ago

Please don't confuse problems with SQL ORMs with SQL itself. SQL stores are powerful, flexible, and quite easily queryable. MongoDB is only a good database for developers if it solves the problems that you need to solve in a way that causes no impedance mismatch.

And for the record, we use both a SQL store, Redis and MongoDB where the use case suits it where I work.

1 more reply

physcab13y ago

This is pretty cool but I'm struggling to see what the use cases are, atleast for analysis. There might be quite a bit of benefits for running application code that I'm not aware of. With regards to analysis though, their own example question is "what happened last night?" but then they go on to say that it is a near real-time data store. Does it matter that it is a real-time mirror then?

I've always liked the paradigm of doing analysis on "slower" data stores, such as Hadoop+Hive or Vertica if you have the money. Decoupling analysis tools from application tools is both convenient and necessary as your organization and data scales.

nelhageOP13y ago

(I wrote MoSQL)

PostgreSQL scales surprisingly well for this purpose, and is much nicer for interactive queries than Hadoop/Hive. We use Impala[1] for some larger datasets, but Impala is comparatively new, and it's nice to have something as battle-tested as postgres here.

As for the "why do we need realtime?": In my mind the benefit of a near-realtime replica is not that you actually often need it, but that it means you never have to ask the question of "Was this snapshot refreshed recently enough?", and never end up having to wait several hours for an enormous dump/load operation, when you realize you did need newer data.

[1] http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-t...

physcab13y ago

Hey! Always cool when the author responds :)

I do agree that PostgreSQL would be nicer for interactive queries. Waiting for a M/R to spin off is a bit of a buzzkill.

With regards to your usecases, what sort of questions have you found yourself answering the most? Do you have analytics applications running off of this?

viralbajaria13y ago

I agree with your points that PostgreSQL (or RDMS in general) is really good for certain type of reporting / analytics use cases while hadoop/hive is awesome for handling billions or rows + TBs of data.

How was your overall experience with impala ? Did you guys have a fairly new hive cluster to try it out or did you just spin up a new one since impala can only read certain file formats (i.e. no custom SerDe).

Also, for hive/hadoop datasets, is that more for just data exploration, while this PostgreSQL solution is for smaller datasets which return in a few seconds and would not perform well in hive due to the cost of setting up a mapreduce job ?

kokey13y ago

Nice work. I spent much time last year building a system that imported MongoDB data into Oracle, having to do everything to speed up bulk loading new data into Oracle (and archiving old data). Something like this would have worked much better, and I suspect it might not be that hard to make this tool work with Oracle.

ozgune13y ago

Out of curiosity, did you guys consider running SQL on MongoDB using Postgres foreign tables? What were the pros and cons of that approach for your use-cases?

(In full disclosure, I wrote mongo_fdw for PostgreSQL.)

1 more reply

bradleyland13y ago

That's your preference, but I've not often found an occasion were someone (a client/stakeholder) said, "Yeah, go ahead and give that to me slower rather than faster." It just never seems to happen.

I'm thinking of this as something like polyglot memoization. Pretty cool when you think about it. Frequently need something that is slow in NoSQL, but fast in SQL? Memoize it to your SQL datastore. The alternative has always been to write it to two places. I kind of dig moving this out to the datastore to figure out.

I'm thinking that plenty of people will find this useful.

physcab13y ago

That's why I'm curious what sort of questions they are answering with this tool. If the bulk of their questions are a variant of SELECT COUNT(DISTINCT user_id) FROM table, then yes, this would be convenient to have. But if their questions start to revolve around transaction cohorting or path analysis where there are potentially hundreds of millions to billions of transaction_ids with some gnarly JOINs thrown in for good measure, I would be surprised to see this scale.

zopticity13y ago

I totally agree with this. I don't understand why you would want to translate MongoDB to MySQL for? What are the advantages to doing a full data dump? There's an overhead converting from MySQL to MongoDB as the two formats are completely different. What about the edge cases when one format doesn't support the other.

nelhageOP13y ago

They each have different advantages and strengths.

MongoDB is great for failover and for rapid development or prototyping. SQL is great for reporting or analytics, since you can do all kinds of aggregates and JOINs right in the database.

The edge cases where you can't represent the data perfectly aren't a huge deal for this use case -- because it's a one-way export, you don't have to be able to round-trip the data, and as long as you can export the data you want to run analysis on, it doesn't matter if there's some you can't get.

dugmartin13y ago

Reading the headline I thought they were introducing a SQL like interface to their API, sort of like FQL for Facebook and I got a little excited. Something like this to get the email addresses of all your active trial subscribers:

SELECT c.email FROM customers c, subscriptions s WHERE c.subscription_id = s.id AND s.status = "active" and s.trial_start IS NOT NULL;

(where of course the customer and subscription tables would be a virtual view on your customers and subscriptions)

pc13y ago

Hm, that could be pretty cool actually. Especially if we also added a REPL for interactive queries at manage.stripe.com.

dugmartin13y ago

You're welcome Patrick. I'd recommend looking at Antlr4 to parse the "StripeSQL" commands.

1 more reply

cglace13y ago

Isn't there a postgres foreign data wrapper than lets you do this?

jabagonuts13y ago

At what point do you abandon mongodb and just use postgresql?

danielpal13y ago

At no-point. Stripe is doing it right. They are using the right tool for each job. Mongo for storage speed etc and then postgres to analyze query etc.

This kind of comment shows how little knowledge you have about NoSQL and SQL. Is not a SQL vs NoSQL, it's about using the right technology for the job.

seanwoods13y ago

> This kind of comment shows how little knowledge you have about NoSQL and SQL.

The question is perfectly valid. In many scenarios (not necessarily Stripe's), PostgreSQL is fast enough to do the job. Stop putting people down for legitimate engineering questions.

eksith13y ago

>This kind of comment shows how little knowledge you have about NoSQL and SQL.

Try not to be condescending and your point will be better received. "Right technology" as I'm sure you're aware, has as much to do with subjectivity as appropriateness. Familiarity, workflow, ease of use (and did I mention familiarity?) cannot be overstated even when the perceived benefits are considered.

Read: religion.

Some of the people who rally against NoSQL may be deriding it from a knee jerk reaction, however others are simply frustrated with developers who, as Ted Dziuba would say, "value technological purity over gettin' shit done".

dennis8213y ago

are you kidding me? There is absolutely NO reason whatsoever to use a NoSQL database for a financial services company. Postgres is more than capable of sustaining the necessary speeds of a startup.

Relational databases were created in the first place to solve these very problems around transactionality and analytics for finance.

This library is a beautiful example of reinventing the wheel, and otherwise creating a patchwork of unnecessary - and ultimately brittle - infrastructure.

4 more replies

lucian190013y ago

The only advantage MongoDB has over Postgres is built-in sharding, and even that is of dubious value.

2 more replies

nodesocket13y ago

10gen also has a nice python app which syncs by tailing the MongoDB oplog to an external source. Most common is Solr.

https://github.com/10gen-labs/mongo-connector/tree/master/mo...

Seems to be high quality, and supports replica sets.

e1ven13y ago

Very neat project. I can see several use-cases for this where I work- It'd be nice to have alternatives means of searching through data.

I'd also like to mention a project I've been contributing to, Mongolike

[My fork is at https://github.com/e1ven/mongolike , once it's merged upstream, that version will be the preferred one ;) ]

It implements mongo-like structures on TOP of Postgres. This has allowed me to support both Mongo and Postgres for a project I'm working on.

Ensorceled13y ago

Nice. Real businesses need a data warehouse and SQL is the right tool for that job.

I thank them for releasing this.

thesis13y ago

Maybe I'm misunderstanding your comment but... Real businesses need real solutions for their use cases. SQL is not necessarily the right tool for "that" job.

PommeDeTerre13y ago

If such a "solution" involves safely querying, analyzing, storing and manipulating data in any way, SQL and relational databases are usually the best option in practice.

It's much more effective and efficient to use a SQL query than it is to throw together a huge amount of imperative JavaScript code (that's usually very specific to a single NoSQL database, as well) merely to perform the equivalent query.

It's much safer to use a database that offers true support for transactions and constraints, rather than trying to hack together that functionality in some Ruby or PHP data layer code, or relying on some vague promise of "eventual consistency", for instance.

It's much more maintainable, and leads to higher-quality data, to spend some time thinking about a schema, rather than just arbitrarily throwing data into a schema-less system, and then having to deal with the lack of a schema throughout any application code that's ever written.

Aside from an extremely small and limited handful of situations (Google and Facebook, for instance), relational databases are the best tool for the job.

taligent13y ago

> Real businesses need a data warehouse and SQL is the right tool for that job.

Honestly. I don't think you could be more misinformed if you tried.

Hint: Google "Big Data".

knightni13y ago

...data warehouses in general mostly use SQL, and lots of businesses use data warehouses successfully. Teradata, Netezza, Oracle, DB2, etc. I'm not sure why his statement was controversial - SQL's a great language for reporting and analytics.

2 more replies

wging13y ago

http://pages.cs.wisc.edu/~jignesh/publ/underattack.pdf

Ingaz13y ago

I thought that "young" NoSQLs sometime in will got SQL interface.

Look at old NoSQLs: Intersystems Cache got SQL interface, GT.M (in PIP-framework) also got SQL.

My impression that MongoDB looks a lot like MUMPS storage with globals in JSON.

andrewjshults13y ago

Is there currently support for "unrolling" arrays or hashes into tables of their own? If not, would definitely be interested in helping to add that on (we use arrays on documents quite a bit, but have run into a number of situations where a simple SQL query for analysis could have quickly replaced a bunch of mongo scripts).

dcraw13y ago

I've added that capability to mongo_fdw, which I use for getmetrica.com. I'll be contributing it back soon (after that 9.2 API conversion). Would be happy to talk to you about the wrapper or Metrica. Email's in my profile.

andrewjshults13y ago

FYI, the email field from the profile doesn't actually get displayed publicly. Mine is (username) @gmail.com

1 more reply

nelhageOP13y ago

There isn't support. It's definitely something I've pondered. If you're interested in adding support, I'd be happy to hear from you at (my username) AT stripe.com.

andrewjshults13y ago

Email sent!

ElGatoVolador13y ago

If you need to make a tool(and use twice the amount of storage) to be able to "query your data" in a SQL manner while using noSQL, it probably means you are using the wrong tool for the job.

j-kidd13y ago

Actually, it is pretty common to replicate the transactional data into another data store for analytical purpose. However, using PostgreSQL as the OLAP data store may not be the wisest move.

hgimenez13y ago

Author of MoSQL, did you consider just using the MongoDB FTW instead? https://github.com/citusdata/mongo_fdw

nelhageOP13y ago

(I wrote MoSQL)

I actually played with mongo_fdw. At this point, it's a really cute hack, and useful for some things, but it doesn't give Postgres enough information and knobs to really let the query planner work effectively, so it ends up being really slow for complex things. I do love the concept, though.

BlackJack13y ago

What were your thoughts on MongoConnector? (https://github.com/10gen-labs/mongo-connector/tree/master/mo...)

bryanjos13y ago

I love this idea. I can see myself using MoSQL pretty soon. Does it handle geospatial data? Can it replicate geospatial data from Mongo to a Geometry data type in Postgres?

danso13y ago

Out of curiousity, but what is the rest of Stripe's stack like? Ruby, apparently, but I'm assuming they don't use any kind of Mongo ORM at all.

scragg13y ago

Someone should write a client library so you can do ad hoc data aggregation queries without using SQL. You can call it NoMoSQL :)

meaty13y ago

Also useful when MongoDB blows chunks because it was a crap architectural decision and you quickly port your app to raw SQL...

arthulia13y ago

Can't wait for NoMoSQL

Uchikoma13y ago

Waiting for BroSQL.

govindkabra3113y ago

how do you deal with sharded mongo clusters?

umur13y ago

(disclosure: I'm one of the founders at Citus Data)

hey, one way to do that is to use the MongoDB foreign data wrapper - also mentioned in some of the earlier threads.

mongo_fdw (https://github.com/citusdata/mongo_fdw) allows you to run SQL on MongoDB on a single node. Citus Data allows you to parallelize your SQL queries across multiple nodes (in this case, multiple MongoDB instances) by just syncing shard metadata. So you would effectively run SQL on a sharded mongo cluster without moving the data anywhere else.

another idea could be to use MoSQL to neatly replicate each mongo instance to a separate PostgreSQL instance, and then use Citus Data to run distributed SQL queries across the resulting PostgreSQL cluster.

dschiptsov13y ago

MongoDB is great for a lot of reasons - record-level locking? multiple concurrent writes? append-only journals?

I have read than in version 2.x they announce some features, so, it is greatness?

j / k navigate · click thread line to collapse

113 comments

gfodor13y ago

http://www.postgresql.org/docs/9.1/static/hstore.html

mrkurt13y ago

Wow, hstore really isn't a great alternative to an actual document DB. The "better" Postgres option would be a JSON type and functional indexes.

shawn-butler13y ago

There is a JSON type but it just validates content.

HSTORE can be fully indexed (gIST and GIN). Just have to roll your own object graphs for nesting if that's what you need to do.

I swear I have typed this exact same comment previously. Deja vu, maybe

2 more replies

vog13y ago

> JSON type and functional indexes

Those "Indexes on Expressions" are really a great feature that can also be combined with XML (not just JSON) and any other types. I recommend everyone to have a look at those:

http://www.postgresql.org/docs/9.2/static/indexes-expression...

1 more reply

taligent13y ago

> So knock one more reason to use MongoDB instead of PostgreSQL off your list.

With the SQL ORMs you can't avoid having to deal with the ER model.

gfodor13y ago

MongoDB et all basically are built around the assumption that a schema is never worth the complexity. It's a bold claim that contradicts many decades worth of database research.

1 more reply

wheaties13y ago

And for the record, we use both a SQL store, Redis and MongoDB where the use case suits it where I work.

1 more reply

physcab13y ago

nelhageOP13y ago

(I wrote MoSQL)

[1] http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-t...

physcab13y ago

Hey! Always cool when the author responds :)

I do agree that PostgreSQL would be nicer for interactive queries. Waiting for a M/R to spin off is a bit of a buzzkill.

With regards to your usecases, what sort of questions have you found yourself answering the most? Do you have analytics applications running off of this?

viralbajaria13y ago

kokey13y ago

ozgune13y ago

Out of curiosity, did you guys consider running SQL on MongoDB using Postgres foreign tables? What were the pros and cons of that approach for your use-cases?

(In full disclosure, I wrote mongo_fdw for PostgreSQL.)

1 more reply

bradleyland13y ago

That's your preference, but I've not often found an occasion were someone (a client/stakeholder) said, "Yeah, go ahead and give that to me slower rather than faster." It just never seems to happen.

I'm thinking that plenty of people will find this useful.

physcab13y ago

zopticity13y ago

nelhageOP13y ago

They each have different advantages and strengths.

MongoDB is great for failover and for rapid development or prototyping. SQL is great for reporting or analytics, since you can do all kinds of aggregates and JOINs right in the database.

dugmartin13y ago

SELECT c.email FROM customers c, subscriptions s WHERE c.subscription_id = s.id AND s.status = "active" and s.trial_start IS NOT NULL;

(where of course the customer and subscription tables would be a virtual view on your customers and subscriptions)

pc13y ago

Hm, that could be pretty cool actually. Especially if we also added a REPL for interactive queries at manage.stripe.com.

dugmartin13y ago

You're welcome Patrick. I'd recommend looking at Antlr4 to parse the "StripeSQL" commands.

1 more reply

cglace13y ago

Isn't there a postgres foreign data wrapper than lets you do this?

jabagonuts13y ago

At what point do you abandon mongodb and just use postgresql?

danielpal13y ago

At no-point. Stripe is doing it right. They are using the right tool for each job. Mongo for storage speed etc and then postgres to analyze query etc.

This kind of comment shows how little knowledge you have about NoSQL and SQL. Is not a SQL vs NoSQL, it's about using the right technology for the job.

seanwoods13y ago

> This kind of comment shows how little knowledge you have about NoSQL and SQL.

The question is perfectly valid. In many scenarios (not necessarily Stripe's), PostgreSQL is fast enough to do the job. Stop putting people down for legitimate engineering questions.

eksith13y ago

>This kind of comment shows how little knowledge you have about NoSQL and SQL.

Read: religion.

dennis8213y ago

are you kidding me? There is absolutely NO reason whatsoever to use a NoSQL database for a financial services company. Postgres is more than capable of sustaining the necessary speeds of a startup.

Relational databases were created in the first place to solve these very problems around transactionality and analytics for finance.

This library is a beautiful example of reinventing the wheel, and otherwise creating a patchwork of unnecessary - and ultimately brittle - infrastructure.

4 more replies

lucian190013y ago

The only advantage MongoDB has over Postgres is built-in sharding, and even that is of dubious value.

2 more replies

nodesocket13y ago

10gen also has a nice python app which syncs by tailing the MongoDB oplog to an external source. Most common is Solr.

https://github.com/10gen-labs/mongo-connector/tree/master/mo...

Seems to be high quality, and supports replica sets.

e1ven13y ago

Very neat project. I can see several use-cases for this where I work- It'd be nice to have alternatives means of searching through data.

I'd also like to mention a project I've been contributing to, Mongolike

[My fork is at https://github.com/e1ven/mongolike , once it's merged upstream, that version will be the preferred one ;) ]

It implements mongo-like structures on TOP of Postgres. This has allowed me to support both Mongo and Postgres for a project I'm working on.

Ensorceled13y ago

Nice. Real businesses need a data warehouse and SQL is the right tool for that job.

I thank them for releasing this.

thesis13y ago

Maybe I'm misunderstanding your comment but... Real businesses need real solutions for their use cases. SQL is not necessarily the right tool for "that" job.

PommeDeTerre13y ago

If such a "solution" involves safely querying, analyzing, storing and manipulating data in any way, SQL and relational databases are usually the best option in practice.

Aside from an extremely small and limited handful of situations (Google and Facebook, for instance), relational databases are the best tool for the job.

taligent13y ago

> Real businesses need a data warehouse and SQL is the right tool for that job.

Honestly. I don't think you could be more misinformed if you tried.

Hint: Google "Big Data".

knightni13y ago

2 more replies

wging13y ago

http://pages.cs.wisc.edu/~jignesh/publ/underattack.pdf

Ingaz13y ago

I thought that "young" NoSQLs sometime in will got SQL interface.

Look at old NoSQLs: Intersystems Cache got SQL interface, GT.M (in PIP-framework) also got SQL.

My impression that MongoDB looks a lot like MUMPS storage with globals in JSON.

andrewjshults13y ago

dcraw13y ago

andrewjshults13y ago

FYI, the email field from the profile doesn't actually get displayed publicly. Mine is (username) @gmail.com

1 more reply

nelhageOP13y ago

There isn't support. It's definitely something I've pondered. If you're interested in adding support, I'd be happy to hear from you at (my username) AT stripe.com.

andrewjshults13y ago

Email sent!

ElGatoVolador13y ago

If you need to make a tool(and use twice the amount of storage) to be able to "query your data" in a SQL manner while using noSQL, it probably means you are using the wrong tool for the job.

j-kidd13y ago

Actually, it is pretty common to replicate the transactional data into another data store for analytical purpose. However, using PostgreSQL as the OLAP data store may not be the wisest move.

hgimenez13y ago

Author of MoSQL, did you consider just using the MongoDB FTW instead? https://github.com/citusdata/mongo_fdw

nelhageOP13y ago

(I wrote MoSQL)

BlackJack13y ago

What were your thoughts on MongoConnector? (https://github.com/10gen-labs/mongo-connector/tree/master/mo...)

bryanjos13y ago

I love this idea. I can see myself using MoSQL pretty soon. Does it handle geospatial data? Can it replicate geospatial data from Mongo to a Geometry data type in Postgres?

danso13y ago

Out of curiousity, but what is the rest of Stripe's stack like? Ruby, apparently, but I'm assuming they don't use any kind of Mongo ORM at all.

scragg13y ago

Someone should write a client library so you can do ad hoc data aggregation queries without using SQL. You can call it NoMoSQL :)

meaty13y ago

Also useful when MongoDB blows chunks because it was a crap architectural decision and you quickly port your app to raw SQL...

arthulia13y ago

Can't wait for NoMoSQL

Uchikoma13y ago

Waiting for BroSQL.

govindkabra3113y ago

how do you deal with sharded mongo clusters?

umur13y ago

(disclosure: I'm one of the founders at Citus Data)

hey, one way to do that is to use the MongoDB foreign data wrapper - also mentioned in some of the earlier threads.

dschiptsov13y ago

MongoDB is great for a lot of reasons - record-level locking? multiple concurrent writes? append-only journals?

I have read than in version 2.x they announce some features, so, it is greatness?

j / k navigate · click thread line to collapse