undefined | Better HN

0 pointschirau1y ago0 comments

care to explain further? I am not sure I understand you fully

0 comments

LLMs are essentially text completion machines, they get some text in as a list of tokens, and their job is to predict what token comes next.

Typically, you're interested in getting more than just one token out of them, so you run them in a loop. You start with the token list containing just the user's message, run the LLM with that list, append the newly obtained token at the end, run the LLM again to see what comes after that new token and so on, until a special end-of-sentence token is generated and the loop terminates.

There's no reason why you have to start with nothing but the users' message, though. You can let the user specify the beginning of the LLM's supposed completion, and ask the LLM to start from that point, instead of generating its completion from scratch. This essentially ensures that what the LLM says begins with a specific string.

Not alll APIs expose this feature, there are good safety reasons not to, but all LLMs are capable of doing it in principle, and doing it with the open ones is trivial.

LLMs are typically trained to output markdown, which uses ```language_name to denote code blocks in language_name, so that user interfaces like Chat GPT's web UI can do proper syntax highlighting.

Therefore, if you make your LLM think that it already started a completion, and that completion began with ```json, it will predict what's most likely to come after that delimiter, and that would be a JSON block.

magicalhippo1y ago

The way the LLMs work is you feed them a vector, array of numbers, that represents a sequence of tokens.

You turn the crank and you get a probability distribution for the next token in the sequence. You then sample the distribution to get the next token, append it to the vector, and do it again and again.

Thus the typical LLM have no memory as such, it inferes what it was thinking by looking at what it has already said and uses that to figure out what to say next, so to speak.

The characters in the input prompt are converted to these tokens, but there are also special tokens such as start of input, end of input, start of output and end of output. The end of output token is how the LLM "tells you" it's done talking.

Normally in a chat scenario these special tokens are inserted by the LLM front-end, say Ollama/llama.cpp in this case.

However if you interface more directly you need to add these yourself, and hence can prefill out the output before feeding the vector to the LLM for the first time, and thus the LLM will "think" it already started writing code say, and thus it is likely to continue doing so.

polotics1y ago

you have described an RNN I think, don't attention heads add something that you could compare to rough &ready understanding?

magicalhippo1y ago

Auto-regressive LLMs do this as I understand it, though it can vary if they feed the combined input and output[1] through the whole net like GPT-2 and friends, or just the decoder[2]. I described the former, and I should have clarified that.

In either case you can "prime it" like it was suggested.

A regular RNN has more feedback[3], like each layer feeding back to itself, as I understand it.

Happy to be corrected though.

[1]: https://jalammar.github.io/illustrated-gpt2/#one-difference-...

[2]: https://medium.com/@ikim1994914/understanding-the-modern-llm...

[3]: https://karpathy.github.io/2015/05/21/rnn-effectiveness/

bonzini1y ago

``` is markdown for pre formatted text. It puts the LLM in the mood of generating machine readable data instead of prose.

jstanley1y ago

A lot of the time you can prevent this by prefilling the output with "```\n", and stopping at "```".

j / k navigate · click thread line to collapse

0 comments

miki1232111y ago

LLMs are essentially text completion machines, they get some text in as a list of tokens, and their job is to predict what token comes next.

Not alll APIs expose this feature, there are good safety reasons not to, but all LLMs are capable of doing it in principle, and doing it with the open ones is trivial.

LLMs are typically trained to output markdown, which uses ```language_name to denote code blocks in language_name, so that user interfaces like Chat GPT's web UI can do proper syntax highlighting.

magicalhippo1y ago

The way the LLMs work is you feed them a vector, array of numbers, that represents a sequence of tokens.

Thus the typical LLM have no memory as such, it inferes what it was thinking by looking at what it has already said and uses that to figure out what to say next, so to speak.

Normally in a chat scenario these special tokens are inserted by the LLM front-end, say Ollama/llama.cpp in this case.

polotics1y ago

you have described an RNN I think, don't attention heads add something that you could compare to rough &ready understanding?

magicalhippo1y ago

In either case you can "prime it" like it was suggested.

A regular RNN has more feedback[3], like each layer feeding back to itself, as I understand it.

Happy to be corrected though.

[1]: https://jalammar.github.io/illustrated-gpt2/#one-difference-...

[2]: https://medium.com/@ikim1994914/understanding-the-modern-llm...

[3]: https://karpathy.github.io/2015/05/21/rnn-effectiveness/

bonzini1y ago

``` is markdown for pre formatted text. It puts the LLM in the mood of generating machine readable data instead of prose.

jstanley1y ago

A lot of the time you can prevent this by prefilling the output with "```\n", and stopping at "```".

j / k navigate · click thread line to collapse