Ending Dependency Chaos: A Proposal for Comprehensive Function Versioning (opens in new tab)

(github.com)

47 pointsdavibu3y ago54 comments

54 comments

It good people are thinking about this problem, but this proposal doesn't address some fundamental issues. E.g,

- why developers/maintainers choose the package granularity they currently do. e.g., you can have tiny granular packages today (npm famously has single-simple-function packages, which is widely derided, BTW). Developers break down packages in a way that makes sense to them to best develop, test, maintain, and release the package. If you reduce the overhead of a small "grains" of package, developers might choose to go a little more granular, but not a lot.

- why people want or need to update. People want or need security updates. People want or need new features and functionality.

So even with this magically fully in-place (there's some tooling implied here), I don't think there would be much impact on updating.

(And people who tried to implement it or use packages that implemented it would be getting burned by version update mistakes -- this seems almost pathologically error-prone -- and when something does go wrong, it will take some new class of tool to even diagnose what went wrong where. People will end up with issues triggered by their personal upgrade path.)

BTW, patch updated don't have to be done at a source or function level at all. (e.g. upgrade from version x to x+1 could be expressed as a delta. Or x to x + 2 for that matter.) This has been popping up for decades, but it seems the practical value must not be worth the trouble because it doesn't seem to catch on in a big way.

Joker_vD3y ago

The really, really fundamental issue this proposal doesn't touch at all is that you also need to

- either version the data structures/classes/shapes of dictionaries/whatever that a function accepts/returns;

- or have converters between different data versions and use them inside your functions.

As I said in another topic on HN which was about that project that hoped to bring hot-code reloading in a C REPL or something: changing the code inside the running program is the least of the problems; flawlessly updating the data inside the running problem so that the new code could proceed to work on it — that's the hard problem (think e.g. about rolling back the update that threw away the bunch of fields).

seanhunter3y ago

Absolutely this. The edge cases are really the hard part here. As a real-world example, one of my worst ever days occurred when we essentially rolled out a routine update which changed how a date object was serialized. This had a very serious bug and needed to be rolled back. The problem was of course we "discovered" the bug when it serialized some important objects with broken dates. When we rolled back, of course the objects were still broken and also now we were on a previous version of the code that couldn't deserialize them. This included the tools that would normally be used to fix broken data so the data rollback which would normally be simple was anything but.

Unless these sorts of things are dealt with, any framework like this will just be solving the part of the problem that isn't really a problem.

ulrikrasmussen3y ago

I think this is effectively achieved by the Unison language: https://www.unison-lang.org/learn/the-big-idea/

davibuOP3y ago

It would be really great if other languages could adopt this idea in their versioning system.

gavinhoward3y ago

I'm currently designing a programming language, and I just might adopt something like this for it.

RcouF1uZ4gsC3y ago

This will make the chaos worse. Instead of having to figure out compatible versions of dozens of packages, you will now have to figure out compatible versions of thousands of functions.

The solution to dependency chaos is grouping dependencies together and versioning the larger group, not splitting into even more dependencies.

davibuOP3y ago

I think at this point we should quantify the proportion of "inflationary" updates (i.e. those that bring absolutely nothing in terms of functionality or security) versus real updates.

Let's take a fictional example: I import D3.js to use the parseDSV() function, after 2 years the method has not received any updates, but the package has gone from version 1.0.2 to 5.0.2. With a granular system, my function would still be on version 1.0.2 (because no changes were made), but with the current system I would have received an unnecessary update.

So, in this case, granular versioning would actually help to put an end to the chaos of dependencies.

IanCal3y ago

Wouldn't there be a bunch of releases with newer numbers? Or would that function maintain a module number of 1, while other functions in the same file would have updated to 2,4,5?

quickthrower23y ago

I don’t think the proposal helps as it puts more burden on package maintainers (honourable semvar for the whole package is burden enough!).

The problem is in NPM culture, and how much churn there is in packages and especially unnecessary breaking changes.

Avoid that and then the problem is reduced from constantly fighting to play API keepup to simply letting security updates flow through.

Let your patch version number go to the moon (which is no real problem practically, computers do big numbers and it is auto automatable.)

capableweb3y ago

> The problem is in NPM culture, and how much churn there is in packages and especially unnecessary breaking changes.

This is a human culture problem if anything. Things cannot be left alone and be called "done" anymore, everything has to constantly "improve", breakage be damned. New connectors HAVE to be invented, even though the improvements are marginal, and now everything "old" doesn't work anymore.

How many times haven't you opened up a tool you use daily/weekly and suddenly the UI has shifted in ways so you cannot understand how to do the task you were supposed to do?

With SaaS, this has become much more prevalent than before. And it's not just the "npm culture" or even JavaScript, this exists everywhere in society, from cars to doors to chairs to airplanes and everything in-between. Obviously, some sectors are better with standards than others, but seems to be happening more and more, everywhere.

Supermancho3y ago

> Things cannot be left alone and be called "done" anymore,

The problem of tracking changes across dependencies exists whether or not this is true. Perhaps the problem is more evident because the feature development and software change processes have become more efficient. eg Detecting the need for changes, new features that are expected to stay competitive, etc. These efficient processes are highlighting this mismatch mitigation, that was easily manageable (or ignored) in the past.

cassianoleal3y ago

> semvar

It's "semver" (with an "e"), short for Semantic Versioning.

https://semver.org/

DiggyJohnson3y ago

A semantic variable could be used to modify the way the system interprets shell and environment variables!

You could have different variable semantics for different namespaces or partitions!

Waterluvian3y ago

I think churn in NPM might be an effect of how quickly the browsers and language are evolving. There’s always some new interface that will make your existing code faster or cleaner.

dboreham3y ago

My sense it is cultural. If there's no consequence for breaking stuff then stuff gets broken. Other languages have a stronger culture of shame from breaking stuff.

interactivecode3y ago

Thinking we need to shame our colleagues more is a really bad take. You can like your obscure programming language for all sorts of reasons, but if they go around shaming people I'm not surprised people would rather use a friendlier language.

1 more reply

kamma44343y ago

Some, eg Clojure, have a culture of not breaking things. Having a library that remains unchanged for years is just fine.

aconbere3y ago

Just hash the whole function and be done with it.

Joe Armstrong made a proposal for this (I’m pretty sure half tongue in cheek).

https://joearms.github.io/published/2015-03-12-The_web_of_na...

Supermancho3y ago

Hashing also allows you to be more confident in the tests for that code. Tests cant catch when additional side effects are added. If the hash changes, you can trigger behavior like restarting the QA on it.

rcme3y ago

What if the functions modify some type of external state. E.g. in TypeScript, what if a module property is updated by one function and referenced in a different function? How would two functions share the same state if they were at different versions?

nine_k3y ago

What's missing there is dependency between functions, between data / types and functions, and versioning of data / types themselves.

Once a node in this graph (a function, a type) changes, it may require a version change of anything that depend on it (a function, a type), because the behavior / contract may materially change even if the code itself did not change!

I suppose this is handled by changing the module version, because that module likely also contains the stateful object whose behavior is now different.

But equally the module version should change once its dependencies change, because the summary behavior of the functions inside the module is now different, as it incorporates the changed behavior of its dependencies.

Because of that I suspect we'll end up with situation similar to today's, with constant updates of our dependencies because their (distant transitive) dependencies changed.

Theoretically we could track dependencies on an individual function level. Then the version o a function may stay the same even if its module's dependencies have changed, because we can prove that these changes did not reflect on the function in any way. I don't think it's realistic for Typescript specifically though, and I don't think it would bring much practical benefit.

thelittlenag3y ago

Came here to point out exactly this.

kazinator3y ago

ELF shared libraries like Glibc do this at the binary level. If some function changes in a way that breaks backward binary compatibility, then it gets versioned; so that existing compiled programs use the compatibility version.

E.g. suppose that there is a new version of pthread_mutex_lock(&mutex) which relies on a larger structure with new members in it. Problem is that compiled programs have pthread_mutex_lock(&mutex) which pass a pointer to the older, smaller structure. If the library were to work with the structure using the new definition, it would access out of bounds. Versioning take care of this; the old clients call a backwards compatible function. It might work with the new definition, but avoids touching the new members that didn't exist in the old library.

But this is a very low-level motivation; this same problem of low-level layout information being baked into the contract shouldn't exist in a higher level language.

rehevkor53y ago

Seems like this person may have been inspired by Rich Hickey's talk https://youtu.be/oyLBGkS5ICk

Regarding "nothing stopping us from making this versioning system completely automated" it seems like that depends on whether your language's type system supports that, and whether programmers follow the rules. For example, if you're relying on varargs/kwargs too much, it's going to be difficult to tell before runtime whether you've broken something.

Aqueous3y ago

Doesn't just stopping using version ranges also help with this? I've never understood why people would allow a package manager to update a piece of their code for them automatically. Using specifiers like ^1.5.3, allowing package manager to go all the way up to version 1.999 automagically is just asking for trouble.

Find a set of versions that is self-compatible and works, and pin all your versions to those specific versions, with a hash if possible. Upgrade on your schedule, not someone else's. Thoughts?

an_d_rew3y ago

In theory, sure!

In practice, it will stay pinned for years until a CVE forces a patch upgrade that ends up triggering a dependency avalanche and weeks or months of headaches.

IanCal3y ago

This is usually managed with lock files.

Package spec puts down what it should work with, you pin a specific version in that range for your app that you've tested.

Otherwise updating things will never happen. Unless you have full separation between upstream dependencies (so you can have multiple versions at the same time) - and that brings huge questions - a single dep 3 steps away can stop you upgrading.

Ranges also communicate "this doesn't work with later than X" as well.

nme013y ago

Isn’t the proposal simply saying that making libs more granular will solve the problem?

I don’t know what’s everyone else’s experience but I was updating dependencies due to either bugs identified in old versions, because I wanted a new feature or because the old version was not supported anymore. Setting dependency to a fixed version was not an option. Using in your code function with given version fixed seems to be problematic.

During updates the problem was to update all other dependencies as a result of the update. I can’t see how the proposed approach would solve it.

Another problem which I sometimes faced (less annoying) was the api change i.e. start using function B instead of function A which requires slightly different parameters. Those kind of automatic refactors could be supplied with library upgrades (some libs already come with automatic migration “scripts”)

js83y ago

I agree, it would be nice to have a refactoring workflow that every program modification only creates new functions, never changes existing ones. Then we could get automated testing of new functions against old functions, or even, automated proof that the change doesn't affect the result.

nine_k3y ago

Purely-functional data structures are known for quite some time [1]. I can imagine that the very same approaches can be used to organize functions (the whole interdependency graphs of them), so that an update produces a new version with relevant parts changed, while the previous version remains there, completely unchanged. An IDE, or some other language tool, can fully automate the necessary legwork.

(It will of course also take some garbage collection mechanism to eventually remove old, disused versions when nobody depends on them any more.)

[1]: https://en.wikipedia.org/wiki/Purely_functional_data_structu...

sparkie3y ago

Unison does this. Every function is identified by a hash of its content, so if the content changes, so does its hash. You can have multiple versions of the same function in the same project without any trouble, because there is never any ambiguity.

MontagFTB3y ago

When a dependency changes versions, I need to update my own code to account for the dependency changes. Then I have to go through the (possibly arduous) process of reconciling those changes with other dependencies that have yet to go through this process.

Version information is essentially a lossy compression- all the changes that go into a given release are summarized into a handful of numbers. Whether this happens at the component level or the function level only changes how lossy the versioning step is. I am not convinced it improves the workflow described above.

th3iedkid3y ago

What stops statement level versioning?

none_to_remain3y ago

My system's .so files have had versioned symbols for ages.

(This problem is not a technical problem.)

chriswarbo3y ago

Version numbers are just part of a name; we can't rely on them, any more than we can rely on package names (e.g. anyone can make a package with the name "aws-sdk"; that doesn't mean they can be trusted with our AWS credentials!)

To actually get dependencies for our software, we need two mechanisms:

- (a) Some way to precisely specify what we depend on

- (b) Some mechanism to fetch those dependencies

Many package managers (NPM, Maven, etc.) use a third-party server for both, e.g.

- (a) We depend on whatever npm.org returns when we ask for FOO

- (b) Fetch dependency FOO by attempting to HTTP GET https://npm.org/FOO; fail if it's not 200 OK

Delegating so much trust to a HTTP call isn't great; so there's an alternative approach based on "lock files":

- (a) We depend on the name FOO with this hash (usually 'trust on first use', where we find the hash by doing an initial HTTP GET, etc. and store the resulting hash)

- (b) Fetch dependency FOO by looking in these local folders, or checking out these git repos, or doing a HTTP GET against these caches, or against these mirrors, or leeching this torrent, etc. Fail if we can't find anything which matches our hash.

The interesting thing about using lock files and hashes, is that our hash of dependency FOO depends on the contents of its lock file; and that content depends on the contents of FOO's dependencies, including their lock files; and so on.

Hence a lock file is a Merkle tree, which pins all of the transitive dependencies of a package: changing any of those dependencies (e.g. to update) requires altering all of the lock files in-between that dependency and our package. That, in turn, alters our lock file, and hence our package's hash.

The author is complaining that such dependency-cascades require a whole bunch of version numbers to get updated. I think it's better to keep track of these things separately: use your version number as documentation, of major/minor/patch changes; and keep track of dependency trees using a separate, cryptographically-secure hash. The thing is, we already have such hashes: they're called git commit IDs!

Other advantages of identifying transitive dependencies with hashes:

- They're not sequential. Our package isn't "out of date" just because we're using hash 1234 instead of 1235. All that matters are the version numbers. In other words, we're distinguishing between "real" updates (a version number changed) and "propagation" (version numbers stayed the same, but a dependency hash changed).

- They're unstructured; e.g. they give us no information about "major" versus "minor" changes, etc. (and hence no need to decide whether an update is one or the other!)

- They can be auto-generated; e.g. we might forget to update our version number, but there's no way we can forget to update our git commit ID!

- They're eventually-consistent: it doesn't matter how updates 'propagate' through each package; each sub-tree will converge to the same hash (NOTE: for this to work we must only take the content hash, not the full history like a git commit ID!).

For example, take the following ("diamond") dependency tree:

                      +--> B --+

                      |        |

  Our package --> A --+        +--> D

                      |        |

                      +--> C --+

When D publishes a new version, B and C should update their lock-files; then A should update its lock-file; then we should update our lock-file. However, this may happen in multiple ways:

- B and C update; A updates (getting new hashes from B and C)

- B updates; A updates; C updates; A updates

- C updates; A updates; B updates; A updates

Using version-numbers (or git commit IDs!) would result in different A packages (one increment versus two increments; or commit IDs with different histories). Using content hashes will give A the same hash/lock-file in all three cases. This also means we're free to propagate updates whenever we like, rather than waiting for things to 'stabilise'; and it's safe to use private forks/patches for propagating updates if we like, without fear of colliding version numbers.

Note that some of this propagation can be avoided if our build picks a single version of each dependency (e.g. Python requires this for entries in its site-packages directory; and Nixpkgs uses laziness and a fixed-point to defer choosing dependencies until the whole set of packages has been defined)

cryptonector3y ago

Versioning every API element is not really scalable.

cwp3y ago

I've done something like his for HTTP APIs. Instead of having versions of the entire API, eg with paths like `/v1/user/9893`, each endpoint had versions. The client would request the specific version using the Accept header.

For example:

  GET /user/9893
  Accept: application/json; charset=utf8; version=1

No semantic versioning, just bumped the version number for each significant change. And yup, "significant" is in the eye of the caller, but it worked out well.

Now this is a bit different from TFA, because the server supported all the versions at the same time, so the caller could choose whatever mix of versions it wanted. This proposal is about assigning version numbers to individual functions rather than the library as a whole - essentially just a documentation/metadata change, with support from package managers.

Here's why this is relevant: the fact that the API was versioned this way had a big impact on how it evolved over time. At first it was pretty much the same as the usual `v1/user/9893` design. But as new versions of specific resources were added, it forced a decoupling of the underlying data model from the schema that were exposed in the interface. Each endpoint-version became an adaptor layer between the contract it offered to the caller and the more generalized, more abstract functionality offered by the data layer. That had costs as well as benefits. New endpoint versions often required an update to the data layer, which in turn required refactoring of older versions to work with the new data layer while continuing to adhere to their contracts. It worked out well, but it did require a change in implementation strategy.

I think the lesson for this proposal is that changing the way package metadata is handled is just the first step. Adopting it could then create pressure for mix and match packaging of the interface functions - "Hey can I get a version of this library with addFunction 1.2.16 and divFunction 2.0.1? I don't want to change all my addition code just to get ZeroDiv protection." That could be done with the right tooling and library design.

Or maybe it makes DLL hell worse because now you have to solve semantic versioning compatibility for every function in a library and that's slower and more sensitive to semantic versioning mistakes. You could get work-arounds like "only ever change one function when you release a new version of the library" or "just bump all the major versions even if they haven't changed."

Or maybe linkers would get built that can do the logic, like "when package A calls package B, use addFunction 1.2.16, but when package C calls package B, use 1.3.1"

Anyway, I don't think this proposal is sufficient on its own. It would either have ripple effects throughout the language ecosystem, or be ineffective because of developers working around it, or not be adopted at all.

jjgreen3y ago

TL;DR (by ChatGPT)

Stopped reading there.

laerus3y ago

You stopped reading because the author used ChatGPT to create a summary of an article they wrote themselves? This may actually be the best use case for ChatGPT.

beepbooptheory3y ago

Without saying one thing or another about the bots ability or propriety, I'd still argue that if the burden of summarizing succinctly something you wrote is so great that you need to pull out high-powered AI technology to do it, you should probably should spend some more time thinking about what you are writing.

Who cares about pure velocity if you are really trying to communicate something? We shouldn't measure written word by pure word count, or how quickly you can ship it. Not everything needs to be just some kind of hyper-advertising.

Just gives the impression they only care so much about what they wrote, that they only care so much about their readers!

davibuOP3y ago

I am the author of the post. The reason is that English is not my native language, and summarizing is very resource-consuming for me, much more than if I had to do it in my native language. But I take note of the antagonistic aspect and I will make sure to rewrite the summary ;)

2 more replies

jjgreen3y ago

Yes, if you can't be bothered to write it, I can't be bothered to read it. It seems that angers some people.

Brian_K_White3y ago

Depending on how exactly you mean use, I would not say so, because chatgpt cannot actually summarize anything.

In this case it may be ok because we may assume the author looked over the result and agrees with it. They could remove the citation as far as I'm concerned, the same way they don't have to cite their spell checker.

But a summary is a distillation of an understanding.

chatgpt does not understand anything, it is merely pattern-matching against and recomposing other texts.

The only reason the result is even half way sensible is because as of today, most other text that it is matching against and recomposing was written by people who did understand what they were writing and writing about.

So I would perhaps agree that a person using it as part of the process of their own writing is a good use case. But I would not agree that chatgpt can summarize things, and would not say that letting it do the entire job of interpreting and restating is a good use case.

65536b3y ago

You claim that ChatGPT cannot understand anything because it merely pattern matches. Humans are basically pattern matching machines, we are just currently better than our computer counterparts. Do humans understand anything? I find this debate over whenever a computer can understand anything rather pointless. If something can produce useful output I don't care if it 'actually understands' anything.

1 more reply

IanCal3y ago

> chatgpt does not understand anything, it is merely pattern-matching against and recomposing other texts.

Research looking at a GPT model trained to play othello showed it had a model of the board state rather than simply pattern matching moves. You're confusing the training of a model with the operation of a model.

davibuOP3y ago

Just for full disclosure: I removed the mention of ChatGPT.

jurimasa3y ago

Why? This is exactly the intended use case of such tech.

jjgreen3y ago

The actual use-case for this tech will be spamming, SEO scamming and cheating on homework, as is rapidly becoming apparent.

quickthrower23y ago

I rolled my eyes, gritted my teeth, but continued

j / k navigate · click thread line to collapse

54 comments

jmull3y ago