Normalisation isn't primarily about about saving storage, it's about avoiding update anomalies i.e. correctness.
"Why do we normalize?"
150 students, in unison: "To make better relations"
"And why do we DE-normalize?"
150 students, in unison: "Performance"
"And what is a database?"
150 students, in unison: "A self-describing set of integrated records"
That was 16 years ago, and I'm still able to say those answers verbatim.
Maybe to someone who could make sense of the DDL and read the language the label col names are written in. And understand all the implicit units, rules around nulls/empties, and presence of magic strings (SSN, SKU) and special numbers (-1) and on and on. For that you need something like RDF and a proper data model.
"Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed."
From Section 1.4, "Normal Form":
"A relation whose domains are all simple can be represented in storage by a two-dimensional column-homogeneous array of the kind discussed above. Some more complicated data structure is necessary for a relation with one or more nonsimple domains. For this reason (and others to be cited below) the possibility of eliminating nonsimple domains appears worth investigating. There is, in fact, a very simple elimination procedure, which we shall call normalization."
As I read this, normalization was originally "about" making storage simpler than it was with contemporaneous models. Section 2.3 ("Consistency") discusses how "data bank state" can become inconsistent, and how such inconsistencies might be addressed, up to and including "the user or someone responsible for the security and integrity of the data is notified".
I think it's reasonable to infer that guaranteed transactional consistency (what I think kpmah means above by "correctness") and the space-saving properties of eliminating redundant data both happened later, and both fell out of the initial motivation of simplification.
[1] https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf -- E. F. Codd, "A Relational Model of Data for Large Shared Data Banks"
[The relational model] provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation and organization of data on the other.
A further advantage of the relational view is that it forms a sound basis for treating derivability, redundancy, and consistency of relations [...]
So I would say it is mainly about flexibility, correctness and the possibility to create a simple yet powerful query language.
The data integrity issue is still a concern, and I talk about that in the book. You need to manage data integrity in your application and think about how to handle updates properly. But it's completely doable and many people have.
This is just another way of saying you need to implement your own system for managing consistency. Dynamo offers transactions now, but they don’t offer actual serialization. Your transaction will simply fail if it runs into any contention. You might think that’s ok, just retry. But because you’ve chosen to sacrifice modelling your data, this will happen a lot. If you want to use aggregates in your Dynamo, you have to update an aggregate field every time you update your data, which means you create single points of contention for your failure prone transactions to run into all over your app. There are ways around this, but they all come at the cost of further sacrificing consistency guarantees. The main issue with that being that once you’ve achieved an inconsistent state, it’s incredibly difficult to even know it’s happened, let alone to fix it.
Then you run into the issue of actually implementing your data access. With a denormalized data set, implementing all your data access comes at the expense of increasing interface complexity. Your schema ends up having objects, which contain arrays of objects, which contain arrays... and you have to design all of your interfaces around keys that belong to the top level parent object.
The relational model wasn’t designed to optimize one type of performance over another. It was designed to optimize operating on a relational dataset, regardless of the implementation details of the underlying management system. Trying to cram a relational set into a NoSQL DB unavoidably comes at the expense of some very serious compromises, with the primary benefit being that the DB itself is easier to administer. It’s not as simple as cost of storage vs cost of compute. NoSQL DBs like dynamo (actually especially dynamo) are great technology, with many perfectly valid use cases. But RDBMS is not one of them, and everybody I’ve seen attempt to use it as an RDBMS eventually regrets it.
I'm liking DynamoDB for tasks that fit nicely within a single domain, have relatively pain-free access patterns, etc. And I've found good fits, but there are some places where the eventual consistency model makes me nervous.
I'm specifically thinking about updating multiple different DynamoDB keys that might need to be aggregated for a data object. The valid answer may be "don't do that!" – if so, what should I do?
(I'll probably just buy the book ;-))
The book is highly recommended by folks at AWS, including Rick Houlihan, the leader of the NoSQL Blackbelt Team at AWS[0].
Happy to answer any questions you have! Also available on Twitter and via email (I'm easily findable).
[0] - https://twitter.com/houlihan_rick/status/1247522640278859777
He could have ignored my email entirely. He's a busy guy, right? I wouldn't have held it against him at all. Instead, he was super nice, provided me with more documentation, and honestly was just really helpful.
If he recommends your book, I'm buying it.
If someone had already successfully using dynamodb in production for a year or two what would be the main value of your course?
For a third-party breakdown of the book structure and content, check out Shawn Wang's review here: https://www.swyx.io/writing/dynamodb-book/
Shawn lists a four-part breakdown that's pretty on-point:
- Background and basics (Chapters 1-6)
- General advice for modeling & implementation (Chapters 7-9)
- DynamoDB strategies, such as how to handle one-to-many relationships, many-to-many relationships, complex filtering, migrations, etc. (Chapters 10-16).
- Five full walkthrough examples, including some pretty complex ones. One of them implements most of the GitHub metadata backend. Nothing related to the git contents specifically but everything around Repos, Issues, PRs, Stars, Forks, Users, Orgs, etc.
The first nine chapters you can probably find available if you google around enough. It's helpful to have all in one place. But the last 13 chapters are unlike anything else available, if I do say so myself. Super in-depth stuff.
I think it will be helpful even if you've been using it for a while, but it really depends on how deep you went on DynamoDB.
It's a bit expensive especially with exchange rates but it is what it is, these things take your time and effort to produce.
From the website I cannot see a table of contents for the book unless this is under provide an email for free chapters. Can you please provide a publicly visible table of contents no email required on the site/here?.
This will be the deciding factor for me making a purchase as I'll be able to see what the book covers and if it'll be of use to me having several years Dynamo experience.
And yep, Cassandra is pretty similar to DynamoDB. Both are wide-column data stores. Some of the original folks that worked on Dynamo (not DynamoDB) at Amazon.com went to Facebook and worked on Cassandra. The concepts underlying Dynamo because the basis for DynamoDB at AWS.
For more background on Dynamo, check out The Dynamo Paper: https://www.allthingsdistributed.com/files/amazon-dynamo-sos...
I've used DynamoDB in production (small scales only sadly) on a number of projects. Definitely done it the wrong way a few times!
Thanks for your support. I really appreciate it. Don't hesitate to hit me up with any questions or feedback.
That said, if anyone has any great recommendations here, I'm all ears. Actual experience would be best if possible, rather than the first thing you see in Google :).
There's the big catch. As another poster pointed out, normalisation is not about efficiency. It's about correctness. People have been quick to make the comparison between storage and compute cost. The high cost of development and bug-fixing time trumps both of them by an order of magnitude. The guarantee of referential-integrity alone that SQL offers helps eradicate an entire class of bugs for your application with no added effort. This article glosses so blithely over this critical caveat. Whenever this discussion comes up I'm quick to refer back to the yardstick of "Does your application have users? If so, then its data is relational". I can't wait for the day when we look back at NoSQL as the 'dancing sickness' of the IT world.
It's also worth questioning: 'At what scale does this tradeoff become worthwhile?' Another poster here correctly pointed out that modern versions of Postgres scale remarkably well. The tipping point where this kind of NoSQL implementation becomes the most efficient option is likely to be far beyond the scale of most products. It's true that completely denormalising your data will make reads much faster, this is undeniable. This does not mean you need to throw the baby out with the bathwater and store your master data in NoSQL.
It does have some compelling use cases. It’s just relational data isn’t one of them. If you have a use case with a low potential for write contention, a tolerance for eventual consistency, a very simple data structure, and a high demand for read throughput, then it’s great. One area that I’ve seen it used with great success is content publishing. You have one author, perhaps an additional editor/proofreader, the content is one document (with perhaps one other related document, like an author bio), and hopefully you want thousands or perhaps millions of people to be able to get decent read performance. Another example could be pretty much anything you’d typically use a Materialized View for in a DB. You can compute the view in your RDBMS, periodically publish it to a document database, and then offload all read throughput to a better suited system.
NoSQL is usually used wrong imo, but that doesn’t mean there aren’t ways to use it right. There’s valid use cases for graph databases and stream processing systems too. But they’re not hip enough to produce the same volume of highly questionable web apps.
When you start with a relational model you don't need a priori knowledge of your data access and you get solid performance and guarantees. If you need this access knowledge beforehand, is DynamoDB best for scaling mature products?
- High-scale situations where you're worried about performance of a relational database, particularly joins, as it scales.
- If you're using serverless compute (e.g. AWS Lambda or AppSync) where traditional databases don't fit well with the connection model.
That said, you can use DynamoDB for almost every OLTP application. It's just more a matter of personal preference as to whether you want to use a relational database or something like DynamoDB. I pick DynamoDB every time b/c I understand how to use it and like the other benefits (billing model, permissions model, performance characteristics), but I won't say you're wrong if you don't choose it in these other situations.
- Less maintenance around schema/migrations
- Data types and validation
- You still get queries (though not to the level of SQL complexity)
- You still get indexes
- You get row-level TTL's like Redis
- Hosted / infinite scale
- Billing based on storage/throughput, not fixed instance sizes
I would say there is much more maintenance, around schema and migrations. Since there is no enforcement of schema at the database level, you need to be very careful in understanding every single way your application(s) work with the data, and ensure that they are backwards and forward compatible. This generally involves writing a lot of custom tooling batch migration logic and ensuring strict control over code that modifies data.
It's very easy to discover schema migration problems in production as the data is accessed.
If you know all your access pattern and your writes >>> reads, a NoSQL solution will be cheaper to operate than Postgres. Meaning, I believe, for most deployments, you can get the same amount of performance from postgres, but simply at a higher cost (which may be 3-6x at most). Another reason to go with NoSQL is if you are latency sensitive, although I don't think Dynamo falls in this bucket.
NoSQL was also really good for OLAP, but I think now there are several really good OLAP solutions (like Clickhouse for OSS and Redshift/BigQuery in the cloud) that are easier to manage.
Those concerns are mostly gone when you rely on a service like DynamoDB. It's not “free”, it comes with increased complexity at the app level, but it does offer a piece of mind if you can afford the $$$.
(I'm a fan of DynamoDB and think there are many good use cases for it. Just saying that the above comparison doesn't seem relevant here.)
It's just that you know Postgres and I know DynamoDB. So go with what you know :)
- LSI can't be created after table is created
- GSI late creation generate back pressure on main table
- If you have an LSI, your table will not scale beyond 10GB
- How often a table will scale up and down per day/hour?
- Cost of auto-scaling in cloudwatch (alarms aren't free)
...and so much more. I've been working with Dynamodb for over 2 years now and I love it.
A DynamoDB table with an LSI can scale far beyond 10GB. That said, I would avoid LSIs in almost all circumstances. Just go with a GSI.
Did business with a startup, signed up, started getting service, they play an intermediary biller / payor role.
Because of an issue in company name used in signup their billing system fell over and didn't setup billing.
But what was crazy is a quickly realized this shop was a noSQL shop. NOTHING connected to anything - so since they hadn't built any reports to cross check any of this they literally did not notice (I noticed other consistency issues elsewhere).
In a SQL database this stuff especially around accounting / money is 101 stuff, but noSQL seemed to really struggle here based on how they'd set it up.
I finally bugged them to charge us, but even that was a bit goofy (basically it looked they exported some transaction to a credit card system - but I doubt had any logic to handle failed payment issues etc).
We have one other vendor where the amount actually charged is a few pennies off the receipt totals - issues with doubles, rounding and application logic or something which doesn't simply use the same number (from database) for order / item detail and total and billing.
So at least in finance / accounting, a ledger which is basically a source of truth, and is linked / summarized / etc in various ways to other parts of systems (sales tax, receipts, credit card charges by order etc) really results in some free consistency wins that don't seem to free in nosql land.
in SQL, normalized, ref integrity on - you simply never get out of sync.
I generally don't recommend DynamoDB as primary data store irrespective of your use case. It takes too much time to model the data. With every new requirement, you have to redo a lot of modelling exercise. Choices you made in beginning start looking bad and you will not remember why you created that particular combination of the composite key or local secondary index which offers no benefit due to incremental changes. Transaction support is painful, existing SDKs just don't cut.
I often wish some of the GCP Firebase features are available in DynamoDB like namespace, control on daily throughput to avoid billing spikes and transaction support.
- Billing model. Pay for reads and writes directly rather than trying to guess how your queries turn into CPU & RAM. Also able to scale reads and writes up and down independently, or use pay-per-use pricing to avoid capacity planning.
- Permissions model: Integrates tightly with AWS IAM so works well with AWS compute (EC2, ECS/EKS, Lambda) with IAM roles. Don't need to think about credential management and rotation.
- Queries will perform the same as you scale. It's going to work the exact same in testing and staging as it is in prod. You don't need to rewrite when you get four times as many users.
A lot of folks are worried about migrations, but they're not as bad as you think. I've got a whole chapter on how to handle migrations. Plus, one of the examples imagines that we're re-visiting a previous example a year later and want to add new objects and change some access patterns. I show how it all works, and they're really not that scary.
Once you learn the principles, it really is like clockwork. It changes your process, but you implement the same process every time.
Honestly, I think part of the problem is that there's a lot of bad NoSQL content out there. A little standardization of process in this space will go a long way, IMO :)
But forget to do normalisation and you will be paying 5 figures a month on your AWS RDS server.
"Storage is cheap as can be, while compute is at a premium."
This person fundamentally does not understand databases. Compute has almost nothing to do with the data layer - or at least, if your DB is maxing on CPU, then something is wrong like a missing index. And for storage, its not like you are just keeping old movies on your old hard disk - you are actively accessing that data.
It would be more correct to say: Disk storage is cheap, but SDRAM cache is x1000 more expensive.
The main issue with databases is IO and the more data you have to read, process and keep in cache, the slower your database becomes. Relational or non-relation still follows these rules of physics.
Oh boy I do love hackernews :).
It sounds like you’ve spent a lot of your career in a SQL world. Have you worked a lot with DDB/MongoDB/Cassandra? If not then give it a whirl with more than a toy application and share your thoughts. Already done that? Try the brand new “constructive criticism” framework.
Instead of “this person fundamentally does not understand databases” based on 13 words in a 1200+ word article, consider: “I disagree with this statement and here’s why”.
You get all of the karma with none of the ad hominem! Win win!
There are other approaches--in cases where I've needed compound keys I've had success using version-5 UUIDs as the primary key constructed from a concatenation of the compound key fields. The advantage is that Dynamo's default optimistic locking works as expected with no transaction needed. A potential disadvantage is if you frequently need to look records up by just one component you'd need a secondary index instead of the primary doing double duty.
Do you talk about the best way to do aggregations in your book? That's one of the more annoying downsides of DynamoDB that I've kind of had to hack my way around. (I combine DynamoDB streams with a Lambda function to increment/decrement a count)
Yep, I do talk about aggregations in the book. One strategy that I've discussed is available in a blog post here[0] and involves using DynamoDB Transactions to handle aggregates.
If you're looking for large-scale aggregates for analytics (e.g. "What are my top-selling items last month?"), I have an Analytics supplement in the Plus package that includes notes on different patterns for analytics w/ DynamoDB.
Let me know if that helps!
[0] - https://www.alexdebrie.com/posts/dynamodb-transactions/#hand...
Have geo data? Need a count? Need to look something up for your team member and you don't have time to setup BI tools? Want to query something differently because of design changes? The answer is pretty much no or you can but it'll take some extra thing for all of those.
We ended up picking up an Aurora Serverless and it is working just fine. We can quickly make changes as business cases change, our entire team can query it for now (until we have fancier tools on the biz end), and we are able to move much more quickly.
I suggest those of you still unfamiliar with nosql operational data storage patterns trust companies like Trek10 and Accenture (where i saw great success).