No comments yet.
We’ve been doing customer interviews for the past couple weeks, and the one feature that is a “table stakes”, “must-have”, “basic need” for all of the data engineers that we interviewed was version control. I made this video https://youtu.be/gVx4JhugCUc showing how we implemented version control. I built a simple version control menu that connects up to the GitHub Rest API (v3). At first, I thought this would be enough, but as I have talked to more people, the picture becomes clear that this is not a simple problem. If any of you guys or gals have similar problems please reach out. We’d be interested in learning about the problem, so we can offer better solutions in the future.
In data engineering, version control can be useful for situations such as when data sources change, ETL automation services change, schemas change, or when business goals change. The big problem is that you don’t want to either start from scratch or refresh all of your tables from scratch when some change happens upstream of the models you are currently working on. I think semantic versioning is an excellent solution to this problem. The idea behind semantic versioning is that each model has its own history - its own changes as well as all of the changes to the models that it uses.
Here's a blog article that goes deeper into the problem - https://www.structure.rest/blog/semantic-versioning-of-data-models)
If this kind of stuff excites you please free to check us out at https://structure.rest or visit or slack: https://join.slack.com/t/structuresupport/shared_invite/zt-ddx04ho4-_q43i5o3zQ9jv00qx~dx8A