What solutions does the HN community have in dealing with this issue?
One master is picked as the final authority, then changes flow to other databases, replicas, etc. either using database tools or applications.
Awareness of deadlines or how much latency (ie. when the updates are needed for each downstream pipeline) is helpful.
A practical example that solves a simple case is the recent release of a tool by Netflix.
Pro tip: when you interview people for a project like this, they should already know the above. :)
Source: DBA.
Mortgage rates change frequently.
The bank likely deals with this exact scenario on a regular basis and should have an established procedure in place.
The distinction and purpose is clearer when you think about aggregating data, such as the various Corona virus trackers are doing. This is a modern example but the scenario applies as far back as humans have been recording data. (And will apply as far forward as well, because physics)
Right now every hospital or testing center is recording the tests they give and their results. Plus they record how many inpatients, how many ICU beds, how many ventilators, etc.
Each hospital has a (at least one) database where that data is recorded.
Separately various news orgs and political offices are keeping track of all of the counts for their region, etc. So some number of times a day they call each of the hospitals to get their counts. (Ok, they don't actually call, but you get the idea). And they aggregate those counts.
Depending on the time that the organizations call the hospitals, a given hospital will give them a different count. The count As Of a particular time.
So different aggregation organizations may all have slightly different counts throughout the day.
You in following the progression of the cases will have picked some particular source that you choose to give you a trustworthy count.
So- the hospital, in keeping Records- is a System of Record.
The news orgs as Sources of Trustworthy information for consumers are Sources of Truth.
The two seem synonomous at first blush- and colloquially when people treat them interchangeably they really at referring to Source of Truth- but on reflection it should make sense that they serve quite different purposes and have quite different requirements. And once you the distinction you see it everywhere. Hope that helps.
Was it a piece of atomic data at the system of generation? Likely SoR
Was it an aggregated/mastered/other data? Likely Source of Truth
SoR = Where it came from
SoT = Where you get it from
Edit: the linked wikipedia articles discriminate the terms by defining SSOT as a single place to store and edit every data element, while SOR allows for replicas, but in the case of data discrepancies, the SOR's state wins.
Which tool is this?
Step 2) Send a message to all the `downstream` systems (i.e. not the Source of Truth systems) with the change description and GUID
Step 3a) Update source of truth with the new info, write the GUID to a log table in source of truth in the same transaction
Step 3b,c,d) Downstream systems poll the source of truth for existence of the GUID in the log table indicating the change has been committed to source of truth. Load latest value from source of truth and commit the change to the downstream system.
Step 4) If GUID never shows up, transaction was rolled back in source of truth (or an error occurred trying to send to one of the other downstreams, aborting before transaction)- do something relevant for your use case. Alert somebody, carry on, whatever.
Create a source of truth. I'd bet a 'source of truth' internal to the bank doesn't exist right now, and they're probably pulling something manually from an external 3rd party.
And on with the regress..
Step -1) What form will this internal source of truth take, how much will it cost, and who builds that system?
Step -2) Who gets to make that decision?
My uncle used to do this kind of work and man, he has some horror stories. The common thread among his clients seemed to be 'We know we need to do something, we know it will cost a lot of time and money, and it absolutely cannot fail.' The hard part always seemed to be finding the person within the org who had both the authority to spend the money, the political clout to make the decision. Nobody wants to own something like that going wrong. The actual technical part of the work, they had that down pretty well as it was all they did.
Here's the high-level approach I would take:
1: Understand which databases serve what, and what the needs for accuracy are. (For example, a database serving their public-facing website can probably be a few seconds behind compared to a database handling real money.)
2: Figure out which database is the source of truth. Update that database first.
3: Update the remaining three databases in a loop: Update the database, then check to make sure the update happened
You will probably need to add some state management where you put yourself into an "updating" state, so that if the update breaks in the middle, you can go back and recover.
----
Another approach is to refactor the applications that are critical (Those that handle money, not the public facing web site) to always read from the database you decide is the source of truth.
If it happens infrequently enough, you can have simply have a proceed "change mortgage rate" that touches the 4 legacy databases. Done.
If it does need a technical solution, you have to establish the leading system / source of truth / whatever you call it. This can be either an existing system or a new one. And then all the others need to follow the leading system.
A more distributed approach (every of the 4 legacy systems can generate events that the other 3 work with) likely won't work with 4 legacy systems. It's often hard enough to get high-fidelity events/deltas out of a single system.
However, one technique I've used which works with all your approaches and guarantees consistency if the data model allows it is to have "future value" entries.
If you just have a single value:
mortgage_rate_percent: 3.0
And this is in 4 databases, there will be a period where they don't match. But if you put a future "valid_from" in there: mortgage_rate_percent: 3.0
valid_from: 2020-04-01 00:00:00-0000
Well, you now have the ability to have the rates all input at different times in different databases, and by querying for only the most recent one (or use from/to dates and find the intersection), they all become active at the exact same moment in time (clock drift aside) as time passes. No magic needed.This seems really obvious once you've seen it, but it's amazing how many people try and worry about source of truth, synchronisation and distributed transaction problems when in many cases (like this), the change can be planned so you can just say "this value will apply in the future" and you have ample time to make sure it gets synch'ed before that time and everything "just works"
I think that's true for 90% of patterns/problems in dev. Did you see that Prince of Persia post the other day? How the original creator came up with Shadow Man? Kind of resonates with what you are getting at in being obvious.
I embarrassingly have a 8 year old top StackOverflow post asking what a fluent interface was. Keeping in mind I'd been devving for 5+ years already at that point. It was just a "ahh F* yeah, that's obvious" moment.
Rates are the important data-item but some of the legacy DBs will need to be notified of changes like fees, as well.
There’s a trade-off between having a manual vs. automated process in the best of circumstances. You might think intuitively that an automated process would be easier and more reliable, but depending on local factors for each of the databases, there might be issues like availability or locking that could make an automated solution more “fragile” than simply having an employee with domain expertise doing an update on a schedule. Reliably automating a process like this would require a lot of careful QA.
So throwing legacy systems into the mix would just aggravate this further. It may be that the cost/benefit of automating this just isn’t there. If it’s cheaper in terms of salary cost to do it by hand for N years than the one-time expense to automate, it may just not make sense to automate. (Obviously over an indefinite time scale it would almost always make sense to automate)
And I say 'system' because as long as the company knows that Sally does the updating each day, but if Sally is on holidays or sick or quits, then there is a protocol where the work is done and the checks/balances are done to ensure that that mortgage rate change is done.
The standard thing to do is to have a source of truth, as described in various comments here like https://news.ycombinator.com/reply?id=22690991&goto=item%3Fi...
Sometimes you'll have multiple sources of truth for (hopefully non-overlapping) subsets of the data in question.
These things are messy. You have to do the hard work of writing sync tools. You want sync tools that work off of incremental / delta feeds, but you also need (slower) tools that can make-it-right based on the totality of your data -- at least the first time you sync two systems, and often also for auditing purposes to make sure that the incremental (fast) sync is working correctly.
We had a prime app, a fifth DB driving it. It had scripts to pull rates and a UI to adjust. Logs all changes.
These rates are pulled by the other systems (eg the AS400 pulled over FTP). Additional internal apps could query over SMB and HTTP.
The prime updated slowly and each of the services would poll it very frequently (4m) + on-demand when needed
2) Then once this is standardized, write a for loop.
A little planned downtime goes a long way.
Name the bank so we know not patronize patronize it.
My money is on your suggestion being what most organizations actually do, though perhaps with a workflow process to manage it.
Legacy systems are hard.