Also occasionally a space appears as a capital G (in Chrome)
Probably a minor issue. Question: Is there a special ruleset that llama3 follows that other LMs don't as far as what qualifies as a token?
When you enter the word "what", the 3 tokens were: start-of-string token, the token "what", and end-of-string token. I made a change now to hide the special start-of-string and end-of-string tokens so that the visualization is a bit simplified.
Adding a space to input changes the tokenization of the input. Sometimes the resulting token count is the same (if the space is merged into some other text), sometimes the resulting token count increases by one (if the space does not get merged).
That part of the tokenizer is working correctly.
> Also occasionally a space appears as a capital G (in Chrome)
Fixed, thanks for reporting! This is a fork of my earlier tokenizer for LLaMA 1 and the demo visualizer had special handling for tokens 0-256 in LLaMA 1. This LLaMA 3 tokenizer doesn't have same special tokens, so some tokens would be visualized in a weird way (like that G thing you reported). I removed that special handling now and it fixed the visualization issue.
> Question: Is there a special ruleset that llama3 follows that other LMs don't as far as what qualifies as a token?
Different models use different tokenization schemes. Most models use some kind of variant of Byte Pair Encoding, trained with their data (the tokenizer itself is also trained, not only the language model).
> Different models use different tokenization schemes
Curious then why this is called "LLaMA 3 tokenizer" what does it have to do with llama3?
The input string "What" (without trailing space) tokenizes into 1 token. The input string "What " tokenizes into 2 tokens. In theory, one might have a tokenizer that would simply tokenize "What " into a single token, but the actual tokenizers we have will tokenize that into at least 2 tokens.
> Curious then why this is called "LLaMA 3 tokenizer" what does it have to do with llama3?
When you input text into any of the LLaMA 3 models, the first step in the process is tokenizing your input. This library is called "LLaMA 3 tokenizer", because it produces the same tokenization as the official LLaMA 3 repo.
When I said that different models use different tokenization schemes, I am talking in comparison to other models, such as LLaMA 1, or GPT-4. Different models use different tokenizers, so the same text is tokenized into different tokens depending on if you're using GPT-4 or LLaMA 3 or what not.