Yes you can, albeit it's pretty silly to do so. LLMs are not (inherently) nondeterministic, you're just not pinning the seed, or are using a remote, managed service for them with no reliability and consistency guarantees. [0]
Here, experiment with me [1]:
- download Ollama 0.9.3 (current latest)
- download the model gemma3n e4b (digest: 15cb39fd9394) using the command "ollama run gemma3n:e4b"
- pin the seed to some constant; let's use 42 as an example, by issuing the following command in the ollama interactive session started in the previous step: "/set parameter seed 42"
- prompt the model: "were the original macs really exactly 9 inch?"
It will respond with:
> You're right to question that! The original Macintosh (released in 1984) *was not exactly 9 inches*. It was marketed as having a *9-inch CRT display*, but the actual usable screen size was a bit smaller. (...)
Full response here: https://pastebin.com/PvFc4yH7
The response should be the same over time, across all devices, regardless of whether GPU-acceleration is available.
Bit of an aside, but the overall sentiment echoed in the article reminds me to how visual programming was going to revolutionize everything and take programmers' jobs. With the exception that AI I find actually useful, and was able to integrate it into my workflow.
[0] All of this is to say, to the extent LLM nondeterminism is currently a model trait, it is substituted using a PRNG. Actual nondeterminism is at most an inference engine trait typically instead, see e.g. batched inference.
[1] Details are for experiment reproduction purposes. You can substitute the listed inference engine, model, seed, and prompt with whatever your prefer for your own set of experiments.
Strangely, I'm only getting 2 alternating results every time I restart the model. I was not able to get the same result as you and certainly not with links to external sources. Is there anything else I could do to try to replicate your result?
I've only used ChatGPT prior and it'd be nice to use locally run models with consistent results.
Of course, double checking the basics would be a good thing to cover: ollama --version should return ollama 0.9.3, and the prompt should be copied and pasted to ensure it's byte-exactly matching.
Maybe you could also try querying the model through its API (localhost:11434/api/generate)? I'll ask a colleague to try and repro on his Mac like last time just to double check. I also tried restarting the model a few times, worked as expected here.
*Update:* getting some serious reproducibility issues across different hardware. A month ago the same experiment with regular quantized gemma3 worked fine between GPU, CPU, and my colleague's Mac, this time the responses differ everywhere (although they are consistent between resets on the same hw). Seems like this model may be more sensitive to hardware differences? I can try generating you a response with regular gemma3 12b qat if you're interested in comparing that.
I got back 0.9.3 as well as copied and pasted the prompt (included quotes and no quotes as well just in case...)
I can try the API as well and I'm using a legion 15ach6 but I could also try on my MacBook Pro.
Pretty reasonable if you ask me. All of this was to say, these are still programs, all the regular sensibilities still apply. Heck, even for that, the sensibilities that apply are pretty old-school: in modern, threaded applications, you'd expect runtime behavioral variations. Not the case here. Even for the high-level language compilers referred to in the article, this doesn't apply so easily. The folks over at reproducible builds [0] put in a decent bit of effort to my knowledge to make it happen.
The overarching point being that it's not magic: it's technology. And if you hold them to even broadly similar standards you hold compilers to, they are absolutely deterministic.
In case you mean that if you pick anything else other than what I picked here, the process ceases to be deterministic, that is not true. You can trivially test and confirm that the same way I did.
Where can i download ChatGPT models?
> The folks over at reproducible builds
I like those folks (hello hboeck!), but their work are unrelated to determinstic LLM output, so why even bring them up here?
> The overarching point being that it's not magic: it's technology.
Yea are you even responding to me, or is this just a stream of thought?
Noone said LLM is magic.
Even if we assume there is value in it, why should it replace (even if in part) the previous activity of reliably making computers do exactly what we want?
(Attaching too much value to the person instead of the argument is more of an ‘argument from authority’)
Maybe that does add up to solving harder higher level real world problems (business problems) from a practical standpoint, perhaps that's what you mean rather than technical problems.
Or maybe you're referring to producing software which utilizes LLMs, rather than using LLMs to program software (which is what I think the blog post is about, but we should certainly discuss both.)
If you've never done web-dev, and want to create an web-app, where does that fall? In principle you could learn web-dev in 1 week/month, so technically you could do it.
> maybe you're referring to producing software which utilizes LLMs
but yes, this is what I meant, outsourcing "business logic" to an LLM instead of trying to express it in code.
When I watch juniors struggle they seem to think that it's because they dont think hard enough whereas it's usually because they didnt build enough infrastructure that would prevent them from needing to think too hard.
As it happens, when it comes to programming, LLM unreliabilities seem to align quite closely with ours so the same guardrails that protect against human programmers' tendencies to fuck up (mostly tests and types) work pretty well for LLMs too.
So you trade reliability to get to that extra 20% of hard cases.
A contrived example: there are only 100 MB of disk space left, but 1 GB of logs to write. LLM discards 900 MB of logs and keeps only the most important lines.
Sure, you can nitpick this example, but it's the kind of edge case handling that LLMs can "do something resonable" that before required hard coding and special casing.
Languages are created to support both computers as well as humans. And to most humans, abstractions such as those presented by, say, Hibernate annotations, are as non-deterministic as can be. To the computer it is all the same, but that is increasingly becoming less relevant, given that software is growing and has to be maintained by humans.
So, yes, LLMs are interesting, but not necessarily that much of a game-changer when compared to the mess we are already in.
I suppose he is aiming for a new book and speaker fees from the LLM industrial complex.
The whole point of computers is that they were deterministic, such that any effective method can be automated - leaving humans to do the non-deterministic (and hopefully more fun) stuff.
Why do we want to break this up-to-now hugely successful symbiosis?
Currently I sometimes get predictions where a variable that doesn't exist gets used or a method call doesn't match the signature. The text of the code might look pretty plausible but it's only relatively late that a tool invocation flags that something is wrong.
If instead of just code text, we trained a model on (code text,IR, bytecode) tuples, (byte code, fuzzer inputs, execution trace) examples, and (trace, natural language description) annotations. The model needs to understand not just what token sequences seem likely but (a) what will the code compile to? (b) what does the code _do_ and (c) how would a human describe this behavior? Bonus points for some path to tie in pre/post conditions, invariants, etc
"People need to adapt to weaker abstractions in the LLM era" is a short term coping strategy. Making models that can reason about abstractions in a much tighter loop and higher fidelity loop may get us code generation we can trust.
It's also a huge barrier to adoption by mainstream businesses, which are used to working to unambiguous business rules. If it's tricky for us developers it's even more frustrating to end users. Very often they end up just saying, f* it, this is too hard.
I also use LLM's to write code and for that they are a huge productivity boon. Just remember to test! But I'm noticing that use of LLM's in mainstream business applications lags the hype quite a bit. They are touted as panaceas, but like any IT technology they are tricky to implement. People always underestimate the effort necessary to get a real return, even with deterministic apps. With indeterministic apps it's an even bigger problem.
Counting tokens is the only reliable defence i found to this.
Not every endpoint works the same way, I'm pretty sure LM Studio's OpenAI-compatible endpoints will silently (from the clients perspective) truncate the context, rather than throw an error. It's up to the client to make sure the context fits in those cases.
OpenAI's own endpoints do show an error and refuses if you exceed the context length though. I think I've seen others use the "finish_reason" attribute too to signal the context length was exceeded, rather than setting an error status code on the response.
Overall, even "OpenAI-compatible" endpoints often aren't 100% faithful reproductions of the OpenAI endpoints, sadly.
It would make sense to me for the chat context to raise an exception. Maybe i should read the docs further…
Both humans and LLMs benefit from non-leaky abstractions—they offload low-level details and free up mental or computational bandwidth for higher-order concerns. When, say, implementing a permissioning system for a web app, I can't simultaneously track memory allocation and how my data model choices aligns with product goals. Abstractions let me ignore the former to "spend" my limited intelligence on the latter; same with LLMs and their context limits.
Yes, more intelligence (at least in part) means being able to handle larger contexts, and maybe superintelligent systems could keep everything "in mind." But even then, abstraction likely remains useful in trading depth for surface area. Chris Sawyer was brilliant enough to write Rollercoaster Tycoon in assembly, but probably wouldn't be able to do the same for Elden Ring.
(Also, at least until LLMs are so transcendentally intelligent they outstrip our ability to understand their actions, HLLs are much more verifiable by humans than assembly is. Admittedly, this might be a time-limited concern)
> Also you wouldn’t want that, humans can’t review bytecode
The one great thing about automation (and formalism) is that you don't have to continuously review it. You vet it once, then you add another mechanism that monitors for wrong output/behavior. And now, the human is free for something else.
With reverse translation as needed.
I imagine that at some point they must wonder what their role is, and why the LLM couldn't do all of that independently.
UP: It lets us state intent in plain language, specs, or examples. We can ask the model to invent code, tests, docs, diagrams—tasks that previously needed human translation from intention to syntax.
BUT SIDEWAYS: Generation is a probability distribution over tokens. Outputs vary with sampling temperature, seed, context length, and even with identical prompts.
I don't think that's how you should think about these things being non-deterministic though.
Let's call that technical determinism, and then introduce a separate concept, practical determinism.
What I'm calling practical determinism is your ability as the author to predict (determine) the results. Two different prompts that mean the same thing to me will give different results, and my ability to reason about the results from changes to my prompt is fuzzy. I can have a rough idea, I can gain skill in this area, but I can't gain anything like the same precision as I have reasoning about the results of code I author.
I think the tricky part is that we tend to think that prompts with similar semantic meaning will give the same outputs (like a human), while LLMs can give vastly different outputs if you have one spelling mistake for example, or used "!" instead of "?", the effect varies greatly per model.
To your second part I wouldn't make that assumption - I can see how a non-technical person might, but surely programmers wouldn't? I've certainly produced very different output from that which I intended in boring old C with a mis-placed semi-colon after all!
Trust me, this response would have been totally different if I were in a different mood.
A lot of the complaints that come up on Hacker News are around the idea that a piece of code needs to be elegantly crafted "Just so" for a particular purpose. An efficient algorithm, a perfectly correct program. (Which, sorry but – have you seen most of the software in the world?)
And that's all well and good – I like the craft too. I'm proud of some very elegant code I've written.
But, the writing is on the wall – this is another turning point in computing similar to the personal computer. People scoffed at that too. "Why would regular people want a computer? Their programs will be awful!"
We looked into how we could blend this by integrating natural language within a programming system here https://blog.ballerina.io/posts/2025-04-26-introducing-natur...
and @bwfan123, we did cite Dijkstra as well as Knuth. I believe sometimes you need the rigor of symbolic notation and other times you need the ambiguity of language - it depends on what you are doing and we can let the user decide how and when to blend the two.
It's not about one or the other, but a gentle and harmonious blending
This is the big game changer: we have a programming environment where the program can improve itself. That is something Fortran couldn’t do.
So: impressions of impressions is the foundation for a declaration of fundamental change?
What exactly is this abstraction? why nature? why new?
RESULT: unfounded, ill-formed expression
[1] https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667...
Two observations:
1. Natural language appears to be to be the starting point of any endeavor.
2.
> It may be illuminating to try to imagine what would have happened if, right from the start our native tongue would have been the only vehicle for the input into and the output from our information processing equipment. My considered guess is that history would, in a sense, have repeated itself, and that computer science would consist mainly of the indeed black art how to bootstrap from there to a sufficiently well-defined formal system. We would need all the intellect in the world to get the interface narrow enough to be usable, and, in view of the history of mankind, it may not be overly pessimistic to guess that to do the job well enough would require again a few thousand years.
LLMs are trying to replicate all of the intellect in the world.
I’m curious if the author would consider that these lofty caveats may be more plausible today than they were when the text was written.
What is missed by many and highlighted in the article is the following: that there is no way to be "precise" with natural languages. The "operational definition" of precision involves formalism. For example, I could describe to you in english how an algorithm works, and maybe you understand it. But for you to precisely run that algorithm requires some formal definition of a machine model and steps involved to program it.
The machine model for english is undefined ! and this could be considered a feature and not a bug. ie, It allows a rich world of human meaning to be communicated. Whereas, formalism limits what can be done and communicated in that framework.
So when we want deterministic process, we invent a set of labels where each is a singleton. Alongside them is a set of rules that specify how to describe their transformation. Then we invented machines that can interpret those instructions. The main advantage was that we know the possible outputs (assuming a good reliability) before we even have to act.
LLMs don't work so well in that regard, as while they have a perfect embedding of textual grammar rules, they don't have a good representation for what those labels refers to. All they have are relations between labels and how likely are they used together. But not what are the sets that those labels refer to and how the items in those sets interact.
We want software to be operationally precise because it allows us to build up towers of abstractions without needing to worry about leaks (even the leakiest software abstraction is far more watertight than any physical "abstraction").
But, at the level of the team or organization that's _building_ the software, there's no such operational precision. Individuals communicating with each other drop down to such precision when useful, but at any endeavor larger than 2-3 people, the _vast_ majority of communication occurs in purely natural language. And yet, this still generates useful software.
The phase change of LLMs is that they're computers that finally are "smart" enough to engage at this level. This is fundamentally different from the world Dijkstra was living in.
Not actually true. Fuzzing and mutation testing have been here for a while.
Otherwise yeah, there are a bunch of non-deterministic technologies, processes and workflows missing, like what Machine Learning folks been doing for decades, which is also software and non-deterministic, but also off-topic from context of the article, as I read it.
This is not the first rodeo of our profession with non-determinism.
Javascript is source code that might be interpreted or might output html target code (by Dom manipulation)
Typescript compiles to javascript.
Now javascript is both source and target code. If you upload javascript code that was generated by ts to your repo and you leave out your ts, that's bad.
Similarly, an LLM has english (or any natural language) as it's source code and typescript (or whatever programming language) as its target code. You shouldn't upload your target code to your repo, and you shouldn't consider it source code.
It's interesting that the compiler in this case is non deterministic, but it doesn't change the fact that the prompts are source code, the vibecode is target code.
I have a repo that showcases this
I could say it's a lab, right ?
I do hope he takes the time to get good with them!
I dunno, sometimes it's helpful to learn about the perspectives of people who've watched something from afar as well, especially if they already have broad knowledge and context that is adjacent to the topic itself, and have lots of people around them deep in the trenches that they've discussed with.
A bit like historians still can provide valuable commentary on wars, even though they (probably) haven't participated in the wars themselves.
What are the new semantics and how are the lower levels precisely implemented?
Spoken language isn’t precise enough for programming.
I’m starting to suspect what people are excited about is the automation.
It's more search and act based on the first output. You don't know what's going to come out, you just hope it will be good. The issue is that that the query is fed to the output function. So what you get is a mixture of what is a mixture of what you told it and what's was stored. Great if you can separate the two afterwards, not so if the output is tainted by the query.
With automation, what you seek is predictability. Not an echo chamber.
ADDENDUM
If we continue with the echo chamber analogy:
Prompt Engineering: Altering your voice so that the result back is more pleasant
System Prompt: The echo chamber's builders altering the configuration to get the above effects
RAG: Sound effects
Agent: Replace yourself in front of the echo chamber with someone/something that act based on the echo.
No thanks. Let's not give up determinism for vague promises of benefits "few of us understand yet".
Weather forecasts are a good example of this.
I wouldn't be surprised to find out different stacks multiple fp16s slightly differently or something. Getting determinism across machines might take some work... but there's really nothing magic going on here.
There were definitely compilers that used things like data-structures with an unstable iteration order resulting in non-determinism, and people went stopping other people from doing that. This behavior would result in non-deterministic performance everywhere, and combined with race conditions or just undefined behavior other random non-deterministic behaviors too.
At least in part this was achieved with techniques that can be used to make LLMs to, like by seeding RNGs in hash tables deterministically. LLMs are in that sense no less deterministic than iterating over a hash table (they are just a bunch of matrix multiplications with a sampling procedure at the end, after all).
Because the human brain is also non-deterministic. If you ask a software engineer the same question on different days, you can easily get different answers.
So I think what we want from LLMs is not determinism, just as that's not really what you'd want from a human. It's more about convergence. Non-determinism is ok, but it shouldn't be all over the map. If you ask the engineer to talk through the best way to solve some problem on Tuesday, then you ask again on Wednesday, you might expect a marginally different answer considering they've had time to think on it, but you'd also expect quite a lot of consistency. If the second answer went in a completely different direction, and there was no clear explanation for why, you'd probably raise an eyebrow.
Similarly, if there really is a single "right" answer to a question, like something fact-based or where best practices are extremely well established, you want convergence around that single answer every time, to the point that you effectively do have determinism in that narrow scope.
LLMs struggle with this. If you ask an LLM to solve the same problem multiple times in code, you're likely to get wildly different approaches each time. Adding more detail and constraints to the prompt helps, but it's definitely an area where LLMs are still far behind humans.
If you run an LLM with optimization turned on on a NVIDIA GPU then you can get non-deterministic results.
But, this is a choice.