There are two divergent analogies to analyze. Compared to programming software, LLMs are experientially closer to the malleability of interacting with humans on the one hand, while functionally closer with hardware entailment-wise, on the other hand. Using symbols that do not overlap with the above: in a computer, hardware W entails that the execution of software S can yield output o from input i. Programmers have the freedom to edit S and experiment with their i/o data. However, it is W, unentailed internally, that defines the boundaries of what is possible for a user: changing W requires hardware engineers and new designs. If W is a desktop computer, programming car engine simulations as S will not turn W into a car. This creative example is meant to indicate that the causation cycle behind a functional component such as W that is unentailed internally at the user/computer level is inherently layered. Similarly, the interactions f has had with H might fully, minimally, or might not at all be considered in the next release. In the case of computers, a user cannot modify W, but hardware engineers can, with full control. With LLMs, not even the researchers can fully control f, for example, to impose that violent themes not be discussed, even using impersonations or a developer mode.
The reason it is important to remain aware that f is not necessarily coevolving with all provided H is the social ease of overlooking how each component is entailed in the current mainstream paradigm. In LLMs, f literally remains unchanged with each interaction, yet a common impression is that we can affect LLMs by chatting with them since the ChatGPT UX is strikingly similar to the experience of chatting with a human. It feels plausible that the effect of just talking to LLMs will be as strong, or even stronger than editing code because when talking to an intelligent human, H can indeed affect their biological f. However, the analogy that holds the strongest with LLMs is that the entailment of f is close to that of hardware W: changing f is much more akin to requesting hardware engineers for edits to W, with the caveat of giving up on editing precision or accuracy, especially if attempting to mask or remove harmful information and when not retraining from scratch. It is true that feedback from H will affect future releases f', but 1) in each release, f remains immutable just like W throughout the interactions, 2) feedback is integrated with delays and slowly, 3) editing is not generally available to users directly except for more superficial fine-tuning, 4) feedback is orders of magnitude less impactful compared to training corpora and design decisions in foundation models, and 5) unlike with designing actual hardware W, even by calling f ChatGPT as a whole, including its many releases, changing f is neither precise nor accurate, as one would imagine an editing process to be and as modifying software S directly is.
Returning to the jailbreak article and using the parallel with the hardware analogy, I assume the intention of the legislators is ultimately to change f, the source of the causation that entails all answers A in chat history H, both by further fine-tuning the models and by adding a list of chat guidelines that are fed as part of the prompt. However, due to the entailment hierarchy of the general architecture, the latter attempt will have zero impact on f itself, thereby not addressing the issue at a fundamental level, while the former strategy will only have a limited effect, due to a fundamental lack of direct control on f: researchers do not have a way to precisely and accurately steer f, and users are even further separated from affecting it.
This is not meant to be dismissive of the current achievements, the results are impressive and the techniques are steadily improving, but rather a critical look at the entailment structures at hand, both perceived and observed, and at the available strategies regarding safeguards. I find it interesting to ask: how can f be made functionally closer to programming S than to W?