undefined | Better HN

0 points_akhe1y ago0 comments

Hm I had not heard of tokenizing like that, typically it's just words or occasionally a word + some adjacent stuff like a punctuation or space. "What " might be a different token than "What" but the total token count shouldn't increment, would just be a different token, right?

> Different models use different tokenization schemes

Curious then why this is called "LLaMA 3 tokenizer" what does it have to do with llama3?

0 comments

belladoreai1y ago

> "What " might be a different token than "What" but the total token count shouldn't increment, would just be a different token, right?

The input string "What" (without trailing space) tokenizes into 1 token. The input string "What " tokenizes into 2 tokens. In theory, one might have a tokenizer that would simply tokenize "What " into a single token, but the actual tokenizers we have will tokenize that into at least 2 tokens.

> Curious then why this is called "LLaMA 3 tokenizer" what does it have to do with llama3?

When you input text into any of the LLaMA 3 models, the first step in the process is tokenizing your input. This library is called "LLaMA 3 tokenizer", because it produces the same tokenization as the official LLaMA 3 repo.

When I said that different models use different tokenization schemes, I am talking in comparison to other models, such as LLaMA 1, or GPT-4. Different models use different tokenizers, so the same text is tokenized into different tokens depending on if you're using GPT-4 or LLaMA 3 or what not.

_akheOP1y ago

Thanks for clarifying, this is exactly where I was confused.

I just read about how both sentencepiece and tiktoken tokenize.

Thanks for making this (in JavaScript no less!) and putting it online! I'm going to use it in my auto-completion library (here: https://github.com/bennyschmidt/next-token-prediction/blob/m...) instead of just `.split(' ')` as I'm pretty sure it will be more nuanced :)

Awesome work!

_akheOP1y ago

Well I installed your npm and tried to integrate it, but no matter what every token is always " word" with a leading space, and it's isolating foreign symbols as standalone tokens. I tried different options to strip those or to not include preceding spaces but it's always that way. It's probably how llama3 tokenizes text but I can't get use out of it for my autocomplete library unfortunately. I would need more-or-less the tokens to be words or occasional phrases.

I really love that it is 0 deps and that you provided the npm, and would love to defer this part of my work to an efficient library like this.

1 more reply

j / k navigate · click thread line to collapse

0 comments

belladoreai1y ago

> "What " might be a different token than "What" but the total token count shouldn't increment, would just be a different token, right?

> Curious then why this is called "LLaMA 3 tokenizer" what does it have to do with llama3?

_akheOP1y ago

Thanks for clarifying, this is exactly where I was confused.

I just read about how both sentencepiece and tiktoken tokenize.

Awesome work!

_akheOP1y ago

I really love that it is 0 deps and that you provided the npm, and would love to defer this part of my work to an efficient library like this.

1 more reply

j / k navigate · click thread line to collapse