Databases Should be Dynamically Typed (opens in new tab)

(blog.mongodb.org)

14 pointssharksandwich16y ago30 comments

30 comments

(I also commented on the article, but nobody seems to be commenting over there, so reposting...)

The advantage of statically typed RDBMS that he's leaving out here are (a) storage optimization (b) query optimization (as jawngee has already noted).

You could get halfway to simulating a dynamically-typed RDBMS by declaring all your columns as, say, VARCHAR(5000). You could store strings, integers, floats, dates, etc. all in there pretty simply. However, they would use a lot more storage space as strings than they would as native data types (e.g. integer 1000 is one byte, string '1000' is 4). Over a large data set that would really add up.

Secondly, when doing queries, your comparison operators (e.g. WHERE date > 2009-07-06) would be way less efficient as string comparisons than native type comparisons.

I don't want to be dismissive and curmudgeonly about this, but over and over what I hear from people enthusiastic about NoSQL solutions is that they solve the current problems of RDBMS, while forgetting all the great features that we have spent the last 30 years building into database systems.

The current "limitations" of SQL-based systems are often in fact age-old trade-offs that we made, but people have forgotten the positive benefits of those trade-offs. The reason databases are statically typed is because that saves storage space and processing time. The reason there's a standardized, domain-specific language that you have to learn is because having to learn a completely new API and mental model every time you want to access a data store from a new vendor is inefficient. Oh, and the reason SQL is so complicated is because relational algebra is complicated.

Sure, SQL and RDBMS have their limitations. They're not the right tool for every job, and they are not even 100% perfect at the jobs where they are the right tool. But too often I hear people saying "fuck SQL!" simply because they don't want to learn it, and because they're too early on in their little pet project to realize the scalability problems a NoSQL system is going to run into that RDBMS solved 20+ years ago.

evgen16y ago

> The reason databases are statically typed is because that saves storage space and processing time

The problem with this argument is that when I can slap several TB into every node in my db cluster the storage cost issue is basically moot. When it is cheaper to buy ten small boxes than one beefy server the processing time and query optimization issue also starts to come into question. At some point soon, and we may have already passed that point, the cost of centralizing db queries through a small number of expensive, highly tuned servers necessary to maintain the pillars of the RDBMS model will be outpaced by the benefits of abandoning both optimizations in favor of being able to use a small, cheap fleet of boxes running a distributed database.

The tradeoffs that were made several decades ago when the RDBMS and SQL ascended to their current dominance may have made sense at the time, but a lot has changed since then. You may try to delude yourself into thinking that there are "scalability problems a NoSQL system is going to run into that RDBMS solved 20+ years ago" but given the fact that almost all of these systems were designed with an eye on under what conditions RDBMS failed and in light of decades of research into distributed systems and scalability that did not exist when SQL/RDBMS emerged I find it hard to believe that there are scalability problems that RDBMS solved that NoSQL systems will run into. Hell, most of them were designed to solve specific scalability failures in big data systems where RDBMS fell over and died.

wvenable16y ago

You're putting a lot of faith into this. The fact is that performance, scalability, features, and correctness are a trade-off. RDBMS is optimized for certain conditions and these NoSQL databases are optimized for other conditions. NoSQL databases are horrible at things that RDBMS's do extremely well and vice-versa. NoSQL isn't some magic discovery -- of course they're going to run into problems that RDBMS solved years ago!

Your argument for small boxes seems to be the same argument for one large box. My (extremely average) desktop has a terabyte of storage in it -- why do I need a bunch of boxes running a distributed database? One big server is a hell of a lot easier to manage than a fleet (a fleet!) of small boxes.

seldo16y ago

You make a very valid point, and it's one I do think about. I try not to bash NoSQL, because it does make sense sometimes, but there's a recent tide of "SQL is dead, long live NoSQL" that ignores the fact that RDBMS are a superior solution in some cases. Really, I think most people agree that RDBMS are sometimes better, and the argument revolves around how often "sometimes" is.

I don't doubt that NoSQL systems solve scalability problems people experience when they use (or rather, mis-use) RDBMS. My point is that they are making trade-offs in terms of storage, data integrity, etc. that they may not be considering, because those things used to be "free" with RDBMS so they weren't a concern.

It's certainly true that RDBMS were born in a very different age in terms of expense of both storage and processing. As with MapReduce, with is an astonishingly inefficient but extremely fast and scalable way of crunching big volumes of data, NoSQL systems may be making use of huge leaps in the availability of resources to prioritize speed and scalability over other concerns.

However, as I mentioned, there seem to be lot of newbie developers who aren't thinking about NoSQL vs. RDBMS in these terms ("speed" vs. "data integrity", say) but instead in terms of "easy" vs. "hard", or even just "new hotness" vs. "old and busted". As a result they may be avoiding one set of problems solved by NoSQL only to run into a totally different set solved by RDBMS, without considering which is more important for their app.

bayareaguy16y ago

integer 1000 is one byte

What sort of byte? Mine only represent integers from 0 to 255.

More seriously, I don't buy your implied argument that plain text representations are less compact. Whether or not this is true depends more on the specifics of your domain and the system you use. For much of the data I work with the strongly typed database representation is often larger then the plain untyped (ASCII) source.

jseifer16y ago

I'll bite. Use the right tool for the job. I think that's what things have always boiled down to. It just so happened that sql was the right tool for the job a lot of the time in the web app space. It still is.

But now we're at a place where there are a lot of options. You can use CouchDB, Tokyo cabinet/tyrant, redis, mongodb, and a host of other alternative storage engines. You don't have to drop your sql database, you can supplement it and take some of the load off of your current db.

I think the whole "right tool for the job" thing takes on new meaning these days. The right tool for the job could now be a combination of sql, mongo, and tokyo cabinet. It depends on the job, the app, and the requirements. But I wouldn't dismiss any NoSQL option so quick.

timwiseman16y ago

I don't think he is dismissing NoSQL options at all. Rather, what I think his point is (and know mine is) is that many people recently tend to dismiss SQL options out of hand for various reasons, often just the newness of NoSQL.

Sometimes NoSQL is truly better, but sometimes SQL is better. Neither should be neglected out of hand.

omouse16y ago

I'm curious...what about having optionally statically typed databases? Like so you can turn on the type checking and optimizations on a per-table basis?

jawngee16y ago

<sarcasm> Yeah that would be totally awesome! Now our databases can't optimize our query plans and any aggregate functions will be doing super cheap type conversions between strings and number types!

And, really, who cares about data consistency? Sure that stupid little script is inserting strings into what should be a numeric column, but I'm sure that won't cause any problems at all when the monthly reports run, or the batch processes that charge our customers run, because the database will know how to handle such cases because the DB developers have written all that extra logic in it for such cases. </sarcasm>

<reality> Stop being lazy. </reality>

reg4c16y ago

Data consistency could be maintained through stronger runtime input verification which in turn nulls everything you said.

I don't think that it would be a bad idea to have dynamically typed databases although it would put a little more strain on developing the actual application instead of having to spend time on making the database.

jawngee16y ago

Yeah, but guess what? More than your single application typically hits your database, so now you have to make sure any and all applications do that runtime input verification and do it correctly.

At Massify, we have 5 different applications with entirely different code bases hitting the same datasources, so now you want me to go and make sure input verification is the same across all of them when I could more easily do it in one source?

It doesn't nullify anything I've said, in fact it proves it!

It would be a HORRIBLE IDEA. Nothing about this is a good idea, if you think it is, please turn in your developer license!

timwiseman16y ago

Yes, but this is normally a bad trade off. For one thing, if you are in a large organization your DBA (or at least your lead DBA) is often a senior person with a lot of experience and was normally a developer before becoming a DBA. Part of the DBA's job is to protect the database, and data integrity, from developer mistakes. In said large organization most of your developers are younger and less experienced and may not always remember to put in the checks in the application. Code review will certainly help, but even there you often have to validate input in a lot of places where you define database types once.

This relationship changes somewhat in smaller organization with a more hand-picked team of course, but even there it is normally less work to set up the schema properly then to make sure every application that ever touches it properly validates every single piece of data for proper type.

There are times when a dynamically typed database makes sense, for instance when it is primarily an object store for an object oriented program. But if you want a relational database with the data integrity and data analysis capabilities it brings to bear, then it probably makes more sense to use a statically typed database.

seldo16y ago

And if we all switched to this tomorrow, six months from now somebody would invent amazing new "static type storage", the advantages of which would include "saves you hours of coding those tiresome data-validation routines we all hate so much!"

elblanco16y ago

This works really well for embedded applications since the code logic should be doing some type checking and verification before insert.

SQLite is a very well known relational database with dynamic typing, it's quite possible the most deployed relational dB in the world since it runs in everything from browsers to mp3 players and if the data inserts are well controlled, don't really result in any particular problems.

1 more reply

fadzlan16y ago

It is quite common to have reports, BI tools and other applications going through your database, so now you have to do the checks on different fronts.

And another very important point, applications errors can be hard to fix, but Terabytes of inconsistent data over the years that your business depends on, now that's nightmare.

noblethrasher16y ago

In fact, I've just come to realize to the benefits of user defined data types, especially for primary keys. You can treat tables as functions and dynamically discover the ways in which you can compose them (i.e. create queries).

neilc16y ago

even with a statically typed database, type matching errors storing data are only reported at runtime! (That is, our java compiler doesn’t check our MySQL schema.)

That isn't necessary the case: for example, PG'Ocaml allows Ocaml programs that access a PostgreSQL database to have their type consistency checked at compile-time.

http://developer.berlios.de/projects/pgocaml/

ct4ul4u16y ago

How does he get from the the text of the article, which approximately says "It would be useful in some cases for a database to be dynamically typed" to the sweeping statement "Databases Should be Dynamically Typed".

As others have noted, relational databases have a rigorous theoretical foundation that is not consistent with what he describes as dynamic typing. He should consider looking at a graph database. Many of these are billed as "RDF Stores" or "Triple Stores". Franz's AllegroGraph is an excellent example.

parenthesis16y ago

SQLite is: http://www.sqlite.org/datatype3.html

wvenable16y ago

Hey, is there anyone out there that uses SQLite and actually uses it's "duck typing" feature to some effect?

I'm using it, but I still declare my "column affinity" and store values with the proper types. I don't put strings in my number columns, for example. I can't even imagine a good reason for that.

elblanco16y ago

And it's a fantastic free (public domain licensing anyone?) embedded dB with readily available support and custom dev services at the end of the phone.

bayareaguy16y ago

One situation that calls for a typeless database is when the system must accept data from a source that's unwilling or unable to adhere to a well-defined static schema.

timwiseman16y ago

Perhaps, but then perhaps a database is not the proper store for that at all, and a relational database almost certainly isn't.

If you really must do it via a relational database, you could use something equivalent to varbinary(max) to accept whatever stream of bits is sent in and then parse them out into something meaningful later (or just store it for the applications retrieval).

bayareaguy16y ago

X ∈ { database } does not imply X ∈ { sql database } or X ∈ { relational database }.

Even if your data is not a good fit for the relational model, it's often a good idea to rely on a database system offering features such as transactions, an ad-hoc query facility, enforcement of access permissions and update policies, replication/recovery, efficient I/O, etc.

1 more reply

wvenable16y ago

When in doubt, serialize to a string. The end result compared with a type-less database is nearly the same.

The age old computing phrase "garbage in, garbage out" applies here.

ecq16y ago

you can use SYS.ANYDATA and SYS.ANYDATASET if you use Oracle. I'm sure other databases support similar feature.

j / k navigate · click thread line to collapse

30 comments

seldo16y ago

(I also commented on the article, but nobody seems to be commenting over there, so reposting...)

The advantage of statically typed RDBMS that he's leaving out here are (a) storage optimization (b) query optimization (as jawngee has already noted).

Secondly, when doing queries, your comparison operators (e.g. WHERE date > 2009-07-06) would be way less efficient as string comparisons than native type comparisons.

evgen16y ago

> The reason databases are statically typed is because that saves storage space and processing time

wvenable16y ago

seldo16y ago

bayareaguy16y ago

integer 1000 is one byte

What sort of byte? Mine only represent integers from 0 to 255.

jseifer16y ago

timwiseman16y ago

Sometimes NoSQL is truly better, but sometimes SQL is better. Neither should be neglected out of hand.

omouse16y ago

I'm curious...what about having optionally statically typed databases? Like so you can turn on the type checking and optimizations on a per-table basis?

jawngee16y ago

<sarcasm> Yeah that would be totally awesome! Now our databases can't optimize our query plans and any aggregate functions will be doing super cheap type conversions between strings and number types!

<reality> Stop being lazy. </reality>

reg4c16y ago

Data consistency could be maintained through stronger runtime input verification which in turn nulls everything you said.

jawngee16y ago

Yeah, but guess what? More than your single application typically hits your database, so now you have to make sure any and all applications do that runtime input verification and do it correctly.

It doesn't nullify anything I've said, in fact it proves it!

It would be a HORRIBLE IDEA. Nothing about this is a good idea, if you think it is, please turn in your developer license!

timwiseman16y ago

seldo16y ago

elblanco16y ago

This works really well for embedded applications since the code logic should be doing some type checking and verification before insert.

1 more reply

fadzlan16y ago

It is quite common to have reports, BI tools and other applications going through your database, so now you have to do the checks on different fronts.

And another very important point, applications errors can be hard to fix, but Terabytes of inconsistent data over the years that your business depends on, now that's nightmare.

noblethrasher16y ago

neilc16y ago

even with a statically typed database, type matching errors storing data are only reported at runtime! (That is, our java compiler doesn’t check our MySQL schema.)

That isn't necessary the case: for example, PG'Ocaml allows Ocaml programs that access a PostgreSQL database to have their type consistency checked at compile-time.

http://developer.berlios.de/projects/pgocaml/

ct4ul4u16y ago

parenthesis16y ago

SQLite is: http://www.sqlite.org/datatype3.html

wvenable16y ago

Hey, is there anyone out there that uses SQLite and actually uses it's "duck typing" feature to some effect?

I'm using it, but I still declare my "column affinity" and store values with the proper types. I don't put strings in my number columns, for example. I can't even imagine a good reason for that.

elblanco16y ago

And it's a fantastic free (public domain licensing anyone?) embedded dB with readily available support and custom dev services at the end of the phone.

bayareaguy16y ago

One situation that calls for a typeless database is when the system must accept data from a source that's unwilling or unable to adhere to a well-defined static schema.

timwiseman16y ago

Perhaps, but then perhaps a database is not the proper store for that at all, and a relational database almost certainly isn't.

bayareaguy16y ago

X ∈ { database } does not imply X ∈ { sql database } or X ∈ { relational database }.

1 more reply

wvenable16y ago

When in doubt, serialize to a string. The end result compared with a type-less database is nearly the same.

The age old computing phrase "garbage in, garbage out" applies here.

ecq16y ago

you can use SYS.ANYDATA and SYS.ANYDATASET if you use Oracle. I'm sure other databases support similar feature.

j / k navigate · click thread line to collapse