Typically, you're interested in getting more than just one token out of them, so you run them in a loop. You start with the token list containing just the user's message, run the LLM with that list, append the newly obtained token at the end, run the LLM again to see what comes after that new token and so on, until a special end-of-sentence token is generated and the loop terminates.
There's no reason why you have to start with nothing but the users' message, though. You can let the user specify the beginning of the LLM's supposed completion, and ask the LLM to start from that point, instead of generating its completion from scratch. This essentially ensures that what the LLM says begins with a specific string.
Not alll APIs expose this feature, there are good safety reasons not to, but all LLMs are capable of doing it in principle, and doing it with the open ones is trivial.
LLMs are typically trained to output markdown, which uses ```language_name to denote code blocks in language_name, so that user interfaces like Chat GPT's web UI can do proper syntax highlighting.
Therefore, if you make your LLM think that it already started a completion, and that completion began with ```json, it will predict what's most likely to come after that delimiter, and that would be a JSON block.
You turn the crank and you get a probability distribution for the next token in the sequence. You then sample the distribution to get the next token, append it to the vector, and do it again and again.
Thus the typical LLM have no memory as such, it inferes what it was thinking by looking at what it has already said and uses that to figure out what to say next, so to speak.
The characters in the input prompt are converted to these tokens, but there are also special tokens such as start of input, end of input, start of output and end of output. The end of output token is how the LLM "tells you" it's done talking.
Normally in a chat scenario these special tokens are inserted by the LLM front-end, say Ollama/llama.cpp in this case.
However if you interface more directly you need to add these yourself, and hence can prefill out the output before feeding the vector to the LLM for the first time, and thus the LLM will "think" it already started writing code say, and thus it is likely to continue doing so.
In either case you can "prime it" like it was suggested.
A regular RNN has more feedback[3], like each layer feeding back to itself, as I understand it.
Happy to be corrected though.
[1]: https://jalammar.github.io/illustrated-gpt2/#one-difference-...
[2]: https://medium.com/@ikim1994914/understanding-the-modern-llm...
[3]: https://karpathy.github.io/2015/05/21/rnn-effectiveness/