Also occasionally a space appears as a capital G (in Chrome)
Probably a minor issue. Question: Is there a special ruleset that llama3 follows that other LMs don't as far as what qualifies as a token?
When you enter the word "what", the 3 tokens were: start-of-string token, the token "what", and end-of-string token. I made a change now to hide the special start-of-string and end-of-string tokens so that the visualization is a bit simplified.
Adding a space to input changes the tokenization of the input. Sometimes the resulting token count is the same (if the space is merged into some other text), sometimes the resulting token count increases by one (if the space does not get merged).
That part of the tokenizer is working correctly.
> Also occasionally a space appears as a capital G (in Chrome)
Fixed, thanks for reporting! This is a fork of my earlier tokenizer for LLaMA 1 and the demo visualizer had special handling for tokens 0-256 in LLaMA 1. This LLaMA 3 tokenizer doesn't have same special tokens, so some tokens would be visualized in a weird way (like that G thing you reported). I removed that special handling now and it fixed the visualization issue.
> Question: Is there a special ruleset that llama3 follows that other LMs don't as far as what qualifies as a token?
Different models use different tokenization schemes. Most models use some kind of variant of Byte Pair Encoding, trained with their data (the tokenizer itself is also trained, not only the language model).
> Different models use different tokenization schemes
Curious then why this is called "LLaMA 3 tokenizer" what does it have to do with llama3?