undefined | Better HN

0 pointsblixt2y ago0 comments

GPT-4o being a truly multimodal model is exciting, does open the door to more interesting products. I was curious about the new tokenizer which uses much fewer tokens for non-English, but also 1.1x fewer tokens for English, so I'm wondering if this means each token now can be more possible values than before? Might make sense provided that they now also have audio and image output tokens? https://openai.com/index/hello-gpt-4o/

I wonder what "fewer tokens" really means then, without context on raising the size of each token? It's a bit like saying my JPEG image is now using 2x fewer words after I switched from a 32-bit to a 64-bit architecture no?

0 comments

zackangelo2y ago

New tokenizer has a much larger vocabulary (200k)[0].

[0] https://github.com/openai/tiktoken/commit/9d01e5670ff50eb74c...

bigyikes2y ago

Besides increasing the vocabulary size, one way to use “fewer tokens” for a given task is to adjust how the tokenizer is trained with respect to that task.

If you increase the amount of non-English language representation in your data set, there will be more tokens which cover non-English concepts.

The previous tokenizer infamously required many more tokens to express a given concept in Japanese compared to English. This is likely because the data the tokenizer was trained on (which is not necessarily the same data the GPT model is trained on) had a lot more English data.

Presumably the new tokenizer was trained on data with a higher proportion of foreign language use and lower proportion of non-language use.

kolinko2y ago

The size can stay the same. Tokens get converted into state which is a vector of 4000+ dimensions. So you could have millions of tokens even and still encode them into the same state size.

j / k navigate · click thread line to collapse