undefined | Better HN

0 pointsgsuuon2y ago0 comments

I'd really like to see smaller models trained on only one specific language, with it's own language specific tokenizer. I imagine the reduction in vocab size would translate to handling more context easier?

0 comments

thewataccount2y ago

I think simply having the vocab more code friendly (e.g. codex) would make the biggest difference, whitespace is the biggest one (afaik every space is a token), but consider how many languages continue `for(int i=0;`, `) {\n`, `} else {`, 'import ', etc.

My understanding is that a model properly trained on multiple languages will beat an expert based system. I feel like programming languages overlap, and interop with each other enough that I wouldn't want to specialize it in just one language.

gsuuonOP2y ago

There's also just far more tokens to train on if you do multi-language. I'd guess only the most popular languages would even have enough training data to get a specialized version - but it would still be an interesting trade off for certain use cases. Being able to run a local code assistant on a typescript-only project for example, with a 32k context window would really come in handy for a lot of people. I don't know enough to understand the impact of vocab size vs context size.

thewataccount2y ago

Its worth noting that from what I can tell - A model well trained in most languages would be able to learn the niche ones much more easily.

The vocab size of llama2 is 32,000. I guess I personally don't think that there's enough difference in programming languages to actually save any meaningful number of tokens considering the magnitude of the current vocab.

gsuuonOP2y ago

I wonder if you could train a model generally across a lot of languages, then specialize for a specific one with a different tokenizer / limited vocabulary? Here's the reference I've been using for llama 2 tokens:

https://huggingface.co/mlc-ai/mlc-chat-Llama-2-7b-chat-hf-q4...

it looks like if you just limit it to English it'd cut the count almost by half - further limiting the vocab to a specific programming language could cut it down even more. Pure armchair theory-crafting on my part, no idea if limiting vocab is even a reasonable way to improve context handling. But it's an interesting idea - build on a base then specialize as needed and let the user swap out the LLM on an as-needed bases (or the front-end tool could simply detect the language of the project). 3B or smaller models with very long context which excel at one specific thing could be really useful (e.g. local code completer for English typescript projects)

j / k navigate · click thread line to collapse

0 comments

thewataccount2y ago

gsuuonOP2y ago

thewataccount2y ago

Its worth noting that from what I can tell - A model well trained in most languages would be able to learn the niche ones much more easily.

gsuuonOP2y ago

https://huggingface.co/mlc-ai/mlc-chat-Llama-2-7b-chat-hf-q4...

j / k navigate · click thread line to collapse