Show HN: LLaMA 3 tokenizer runs in the browser (opens in new tab)

(belladoreai.github.io)

10 pointsbelladoreai1y ago12 comments

12 comments

I'm not sure it's working correctly, I entered the word "what" and it says "4 characters, 3 tokens", I type a space and it says "4 tokens" - shouldn't it just be 1 token? and the space shouldn't count in this case?

Also occasionally a space appears as a capital G (in Chrome)

Probably a minor issue. Question: Is there a special ruleset that llama3 follows that other LMs don't as far as what qualifies as a token?

belladoreaiOP1y ago

> I'm not sure it's working correctly, I entered the word "what" and it says "4 characters, 3 tokens", I type a space and it says "4 tokens" - shouldn't it just be 1 token? and the space shouldn't count in this case?

When you enter the word "what", the 3 tokens were: start-of-string token, the token "what", and end-of-string token. I made a change now to hide the special start-of-string and end-of-string tokens so that the visualization is a bit simplified.

Adding a space to input changes the tokenization of the input. Sometimes the resulting token count is the same (if the space is merged into some other text), sometimes the resulting token count increases by one (if the space does not get merged).

That part of the tokenizer is working correctly.

> Also occasionally a space appears as a capital G (in Chrome)

Fixed, thanks for reporting! This is a fork of my earlier tokenizer for LLaMA 1 and the demo visualizer had special handling for tokens 0-256 in LLaMA 1. This LLaMA 3 tokenizer doesn't have same special tokens, so some tokens would be visualized in a weird way (like that G thing you reported). I removed that special handling now and it fixed the visualization issue.

> Question: Is there a special ruleset that llama3 follows that other LMs don't as far as what qualifies as a token?

Different models use different tokenization schemes. Most models use some kind of variant of Byte Pair Encoding, trained with their data (the tokenizer itself is also trained, not only the language model).

_akhe1y ago

Hm I had not heard of tokenizing like that, typically it's just words or occasionally a word + some adjacent stuff like a punctuation or space. "What " might be a different token than "What" but the total token count shouldn't increment, would just be a different token, right?

> Different models use different tokenization schemes

Curious then why this is called "LLaMA 3 tokenizer" what does it have to do with llama3?

1 more reply

belladoreaiOP1y ago

GitHub link: https://github.com/belladoreai/llama3-tokenizer-js

mrbishalsaha1y ago

Really good. I am actually using js-tiktokken and wish there was a package to handle all the other LLMs also but still something I can work with.

belladoreaiOP1y ago

If you need to work with multiple LLMs, you probably want to use transformers.js

mrbishalsaha1y ago

Isn't it to much for just calculating the number of token?

1 more reply

j / k navigate · click thread line to collapse

12 comments

_akhe1y ago

Also occasionally a space appears as a capital G (in Chrome)

Probably a minor issue. Question: Is there a special ruleset that llama3 follows that other LMs don't as far as what qualifies as a token?

belladoreaiOP1y ago

That part of the tokenizer is working correctly.

> Also occasionally a space appears as a capital G (in Chrome)

> Question: Is there a special ruleset that llama3 follows that other LMs don't as far as what qualifies as a token?

_akhe1y ago

> Different models use different tokenization schemes

Curious then why this is called "LLaMA 3 tokenizer" what does it have to do with llama3?

1 more reply

belladoreaiOP1y ago

GitHub link: https://github.com/belladoreai/llama3-tokenizer-js

mrbishalsaha1y ago

Really good. I am actually using js-tiktokken and wish there was a package to handle all the other LLMs also but still something I can work with.

belladoreaiOP1y ago

If you need to work with multiple LLMs, you probably want to use transformers.js

mrbishalsaha1y ago

Isn't it to much for just calculating the number of token?

1 more reply

j / k navigate · click thread line to collapse