If the budget is indeed so modest (5.5 million euros!), I would focus completely on preparing datasets and making sure all open cultural artifacts that we can find are well documented in them. That way every model, private or open, that gets trained in the future could better represent the culture and language of your country.
this is the type of question that should never ever be asked to an llm running on some A100 on the other side of the world, local llms are already more than capable to answer these
Sovereign SOTA models might also be possible with nation-state involvement. But this is a good stopgap.
There is no public website to use it, be it free or paid, the dataset is not public, the code is not public (The github URL in the article returns 404 ), the claimed model intelligence is so low that is pretty much useless at 32K context and massively inferior to GPT‑4o.
As per tradition in Portugal, some people managed to get 5.5 Million to produce nothing and no one is asking questions.
You want a better idea? Just fine tune the open source Kimi 2.6 with an open source Portuguese dataset, the cost would be under a million and we would be getting something useful.
It would be really nice to know what happened to 5.5 Millions whilst not being able to even provide a functional website to use the model.
Similar waste.
> Just fine tune the open source Kimi 2.6 with an open source Portuguese dataset
I think that tokenizers of all popular models are heavily biased towards English or English and Mandarin.
And I don't think that it is possibple to replace the tokenizer without full retraining.
```
Llama3
english, 0.216
portuguese, 0.285
italian, 0.287
greek, 0.592
```
```
Gemma4
english, 0.219
portuguese, 0.246
italian, 0.249
greek, 0.537
```
```
Kimi2.6
english, 0.214
portuguese, 0.310
italian, 0.308
greek, 0.716
```
Portuguese is worse than English certainly, but it is on par with Italian (which I think has more overlap with English) and much better than Greek (since it doesn't use the Latin script and is definitely not prioritized in the tokenizer construction).
On your second point, tokenizer transfer allows for extending/modifying a tokenizer without retraining the model from scratch. The simplest version of this is tokenizer extension + continual pretraining, where you just add a bunch more tokens to the vocab for the language/domain that you want to improve and train a little more. It's been done for Japanese [2] and Indic languages, but afaik not Portuguese.
So I think that continual pretraining for a large base model would have probably been fine for this case with huge cost savings. But it is good to have the ability to train your own base models, so I don't think this is such a bad idea.
-----------------------
[1]: https://huggingface.co/datasets/goldfish-models/fish-food
All in all, I don't think that's a major issue here.
I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English)
Then again, if you go to Miranda de Douro, they’ll say the rest of Portugal has been talking nonsense for the last 700 years, so the purists at least always have their concents to retreat to if they so choose.
That's easy to say when you're not on the other end of US defaultism.
https://en.wikipedia.org/wiki/List_of_languages_by_number_of...
that's not impressive
The grammar and vocabularies don't match, but I think the worst are the expressions. Both sides have *a lot* of expressions that vary per context and location.
https://simianwords.bearblog.dev/why-domain-specific-llms-wo...
Trying to force a LLM into a specific language makes you missed out on most of the world knowledge.
Besides, there is knowledge that is locked behind languages, there are things known in Portuguese that aren't known in other languages, and the same for other languages too. More accessibility to those ideas wouldn't hurt.
It's just a database. If you push text in one language into it, it'll likely crap out stuff in that same language, unless the system prompt that also goes in with your query causes it not to.
and, who knows what will happen to grammar ?