> "What " might be a different token than "What" but the total token count shouldn't increment, would just be a different token, right?
The input string "What" (without trailing space) tokenizes into 1 token. The input string "What " tokenizes into 2 tokens. In theory, one might have a tokenizer that would simply tokenize "What " into a single token, but the actual tokenizers we have will tokenize that into at least 2 tokens.
> Curious then why this is called "LLaMA 3 tokenizer" what does it have to do with llama3?
When you input text into any of the LLaMA 3 models, the first step in the process is tokenizing your input. This library is called "LLaMA 3 tokenizer", because it produces the same tokenization as the official LLaMA 3 repo.
When I said that different models use different tokenization schemes, I am talking in comparison to other models, such as LLaMA 1, or GPT-4. Different models use different tokenizers, so the same text is tokenized into different tokens depending on if you're using GPT-4 or LLaMA 3 or what not.