AMÁLIA and the future of European Portuguese LLMs (opens in new tab)

(duarteocarmo.com)

136 pointsjohnbarron17d ago68 comments

68 comments

I'm not sure the direction should be to finetune a small local model for each country or language. These models are already not particularly great at information retrieval, so I doubt anyone would use them for questions like the author suggests (ie who was the president between X and Y). Similarly, they are a little too lightweight to be used for translations too.

If the budget is indeed so modest (5.5 million euros!), I would focus completely on preparing datasets and making sure all open cultural artifacts that we can find are well documented in them. That way every model, private or open, that gets trained in the future could better represent the culture and language of your country.

iugtmkbdfil83414d ago

I agree, the research is complex enough as is without having to worry about splitting it babel-like into multiple languages.

dudefeliciano13d ago

> who was the president between X and Y

this is the type of question that should never ever be asked to an llm running on some A100 on the other side of the world, local llms are already more than capable to answer these

dyauspitr14d ago

Yeah I think India is going the better route with Sarvam which is trained from scratch and still relatively cheap.

TheMagicHorsey14d ago

This is the way.

Sovereign SOTA models might also be possible with nation-state involvement. But this is a good stopgap.

mariopt14d ago

This model is a waste of Public Funds.

There is no public website to use it, be it free or paid, the dataset is not public, the code is not public (The github URL in the article returns 404 ), the claimed model intelligence is so low that is pretty much useless at 32K context and massively inferior to GPT‑4o.

As per tradition in Portugal, some people managed to get 5.5 Million to produce nothing and no one is asking questions.

You want a better idea? Just fine tune the open source Kimi 2.6 with an open source Portuguese dataset, the cost would be under a million and we would be getting something useful.

It would be really nice to know what happened to 5.5 Millions whilst not being able to even provide a functional website to use the model.

upupupandaway14d ago

As a pt-BR speaker from across the pond: https://soberania.ai/

Similar waste.

avdelazeri13d ago

Given that their publication says the dataset is freely available on Huggingface that's at least something ig

gverrilla13d ago

Why?

dr_dshiv14d ago

It’s a way to suck all the money out of the room in the name of nationalism — and it’s all over Europe. Only idea everyone has had.

vova_hn214d ago

I'm not arguing with the rest of your points, but...

> Just fine tune the open source Kimi 2.6 with an open source Portuguese dataset

I think that tokenizers of all popular models are heavily biased towards English or English and Mandarin.

And I don't think that it is possibple to replace the tokenizer without full retraining.

mcyc14d ago

You are right about most tokenizers being heavily biased towards English, but the situation is not so bad for Portuguese. Here are some results on the Goldfish corpus [1] with a few different tokenizers. This measures #characters in corpus / #subwords in tokenized corpus.

```

Llama3

english, 0.216

portuguese, 0.285

italian, 0.287

greek, 0.592

```

Gemma4

english, 0.219

portuguese, 0.246

italian, 0.249

greek, 0.537

```

Kimi2.6

english, 0.214

portuguese, 0.310

italian, 0.308

greek, 0.716

```

Portuguese is worse than English certainly, but it is on par with Italian (which I think has more overlap with English) and much better than Greek (since it doesn't use the Latin script and is definitely not prioritized in the tokenizer construction).

On your second point, tokenizer transfer allows for extending/modifying a tokenizer without retraining the model from scratch. The simplest version of this is tokenizer extension + continual pretraining, where you just add a bunch more tokens to the vocab for the language/domain that you want to improve and train a little more. It's been done for Japanese [2] and Indic languages, but afaik not Portuguese.

So I think that continual pretraining for a large base model would have probably been fine for this case with huge cost savings. But it is good to have the ability to train your own base models, so I don't think this is such a bad idea.

-----------------------

[1]: https://huggingface.co/datasets/goldfish-models/fish-food

[2]: https://arxiv.org/abs/2404.17790

alexaholic14d ago

The Amália model is not yet publicly available. Until it's ready, one can fool around with Anália at https://analia.pt

ncruces13d ago

That name change…

pelf13d ago

I just died.

swiftcoder14d ago

It is definitely an interesting problem, because Portugal is a small enough country that the actual total corpus of available texts in (non-Brazilian) Portuguese is potentially problematic.

embedding-shape14d ago

I don't think so, Portugal the country might be small, with a small population, but there is ~250 million "Lusophones" (native Portuguese speakers), making it the fifth-most spoken native language in the world, I'd hardly call that small :) And before everyone screams; yes, European Portuguese is different from Brazilian Portuguese, but they're still both Portuguese and understand each other, so it's not like the text from one cannot be used to train a model for the other, or vice-versa.

All in all, I don't think that's a major issue here.

swiftcoder14d ago

The authors are pretty clearly trying to draw only from European Portuguese sources - I feel like there's a fairly widespread attitude here that the language is being overwhelmed by the sheer number of Brazilian speakers (which there is obviously at least some truth to).

I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English)

madaxe_again14d ago

Man, there’s an attitude up here in trás-os-montes that the rest of Portugal has spoken unrecognisable trash for a century. It took me years to realise I’d learned hilariously antique Portuguese by moving there.

Then again, if you go to Miranda de Douro, they’ll say the rest of Portugal has been talking nonsense for the last 700 years, so the purists at least always have their concents to retreat to if they so choose.

philipwhiuk14d ago

> I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English).

That's easy to say when you're not on the other end of US defaultism.

mghackerlady14d ago

Right, but most of those speak brazilian portuguese. There's so much less european portuguese text that it becomes impossible for a model to not speak brazilian portuguese if not trained in a way that ignores brazilian sources

KK7NIL14d ago

The whole point of this project is to have an LLM that speaks European Portuguese, not Brazilian Portuguese.

embedding-shape14d ago

Right, and my point is that if you use 80% Brazilian Portuguese during base model training + 20% European Portuguese as post-training, you pretty much get exactly that, except with a ton more of available training data.

1 more reply

madaxe_again14d ago

Mutually intelligible, yes, but far from perfectly so. I speak both, as a native anglophone, and the difference is not so much “US vs British English” so much as “Guyanese English vs British English”. Like, fundamental points of grammar differ, the spoken rhythm and syllabic stress differs (poetry does not translate well between them), never mind just vocabulary. Continental Portuguese people tend to find it easier to understand brasileiros than vice versa, largely due to mostly one-way cultural exports, but to try to roll both into a single model would create a creole at best.

embedding-shape14d ago

I agree, they're not the same. But they're far closer than other languages who don't come from the same families.

fy2014d ago

European Portuguese is the 13th most populous language in Europe. Not that small, there are many other European languages in use that are much smaller.

https://en.wikipedia.org/wiki/List_of_languages_by_number_of...

depaulagu14d ago

> European Portuguese is the 13th most populous language in Europe

that's not impressive

r2ob14d ago

"This model is a waste of Public Funds". There is no "public funds", this is a waste of money from the tax payers.

drivebyhooting14d ago

I’ve noticed that ChatGPT is noticeably dumber in languages other than English. It even will confidently repeat common but wrong superstitions from the target language as if they were fact.

mt_14d ago

5 million for a llama-2 finetune, how is that impressive?

algoth114d ago

Wouldnt it be easier to fine tune a model to convert the Brazilian Portuguese corpus into European Portuguese and then use that corpus?

kinow13d ago

That idea is different than what most are talking here in other comments.

The grammar and vocabularies don't match, but I think the worst are the expressions. Both sides have *a lot* of expressions that vary per context and location.

simianwords14d ago

Domain specific models will never be a thing. You don't get generalised intelligence with that.

https://simianwords.bearblog.dev/why-domain-specific-llms-wo...

hartator14d ago

What a waste of time and money.

Trying to force a LLM into a specific language makes you missed out on most of the world knowledge.

embedding-shape14d ago

What LLM isn't forced into a specific language? That'd be a weird language model no one could understand, you need to chose at least one language, ideally the same as the creators speak.

Besides, there is knowledge that is locked behind languages, there are things known in Portuguese that aren't known in other languages, and the same for other languages too. More accessibility to those ideas wouldn't hurt.

Miraste14d ago

To my knowledge, all major LLMs are multilingual. This article could really have used an evaluation of existing models' European Portuguese capabilities.

numpad014d ago

yeah, they seem all confined to being an American-consultant-Chinese-authoritarian split personality with broad second language capabilities. I suppose they become too incoherent otherwise.

cess1114d ago

E.g. gemma3:4b can fake simple conversations in several european languages, including portuguese, swedish and finnish.

It's just a database. If you push text in one language into it, it'll likely crap out stuff in that same language, unless the system prompt that also goes in with your query causes it not to.

KK7NIL14d ago

This is how Europe thinks they can catch up on tech, by having the government fund vanity projects which will be made obsolete by more general techniques in 6 months.

mistrial914d ago

> makes you missed out on most of the world knowledge

and, who knows what will happen to grammar ?

j / k navigate · click thread line to collapse

68 comments

pu_pe14d ago

iugtmkbdfil83414d ago

I agree, the research is complex enough as is without having to worry about splitting it babel-like into multiple languages.

dudefeliciano13d ago

> who was the president between X and Y

this is the type of question that should never ever be asked to an llm running on some A100 on the other side of the world, local llms are already more than capable to answer these

dyauspitr14d ago

Yeah I think India is going the better route with Sarvam which is trained from scratch and still relatively cheap.

TheMagicHorsey14d ago

This is the way.

Sovereign SOTA models might also be possible with nation-state involvement. But this is a good stopgap.

mariopt14d ago

This model is a waste of Public Funds.

As per tradition in Portugal, some people managed to get 5.5 Million to produce nothing and no one is asking questions.

You want a better idea? Just fine tune the open source Kimi 2.6 with an open source Portuguese dataset, the cost would be under a million and we would be getting something useful.

It would be really nice to know what happened to 5.5 Millions whilst not being able to even provide a functional website to use the model.

upupupandaway14d ago

As a pt-BR speaker from across the pond: https://soberania.ai/

Similar waste.

avdelazeri13d ago

Given that their publication says the dataset is freely available on Huggingface that's at least something ig

gverrilla13d ago

Why?

dr_dshiv14d ago

It’s a way to suck all the money out of the room in the name of nationalism — and it’s all over Europe. Only idea everyone has had.

vova_hn214d ago

I'm not arguing with the rest of your points, but...

> Just fine tune the open source Kimi 2.6 with an open source Portuguese dataset

I think that tokenizers of all popular models are heavily biased towards English or English and Mandarin.

And I don't think that it is possibple to replace the tokenizer without full retraining.

mcyc14d ago

```

Llama3

english, 0.216

portuguese, 0.285

italian, 0.287

greek, 0.592

```

Gemma4

english, 0.219

portuguese, 0.246

italian, 0.249

greek, 0.537

```

Kimi2.6

english, 0.214

portuguese, 0.310

italian, 0.308

greek, 0.716

```

-----------------------

[1]: https://huggingface.co/datasets/goldfish-models/fish-food

[2]: https://arxiv.org/abs/2404.17790

alexaholic14d ago

The Amália model is not yet publicly available. Until it's ready, one can fool around with Anália at https://analia.pt

ncruces13d ago

That name change…

pelf13d ago

I just died.

swiftcoder14d ago

It is definitely an interesting problem, because Portugal is a small enough country that the actual total corpus of available texts in (non-Brazilian) Portuguese is potentially problematic.

embedding-shape14d ago

All in all, I don't think that's a major issue here.

swiftcoder14d ago

I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English)

madaxe_again14d ago

philipwhiuk14d ago

> I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English).

That's easy to say when you're not on the other end of US defaultism.

mghackerlady14d ago

KK7NIL14d ago

The whole point of this project is to have an LLM that speaks European Portuguese, not Brazilian Portuguese.

embedding-shape14d ago

1 more reply

madaxe_again14d ago

embedding-shape14d ago

I agree, they're not the same. But they're far closer than other languages who don't come from the same families.

fy2014d ago

European Portuguese is the 13th most populous language in Europe. Not that small, there are many other European languages in use that are much smaller.

https://en.wikipedia.org/wiki/List_of_languages_by_number_of...

depaulagu14d ago

> European Portuguese is the 13th most populous language in Europe

that's not impressive

r2ob14d ago

"This model is a waste of Public Funds". There is no "public funds", this is a waste of money from the tax payers.

drivebyhooting14d ago

I’ve noticed that ChatGPT is noticeably dumber in languages other than English. It even will confidently repeat common but wrong superstitions from the target language as if they were fact.

mt_14d ago

5 million for a llama-2 finetune, how is that impressive?

algoth114d ago

Wouldnt it be easier to fine tune a model to convert the Brazilian Portuguese corpus into European Portuguese and then use that corpus?

kinow13d ago

That idea is different than what most are talking here in other comments.

The grammar and vocabularies don't match, but I think the worst are the expressions. Both sides have *a lot* of expressions that vary per context and location.

simianwords14d ago

Domain specific models will never be a thing. You don't get generalised intelligence with that.

https://simianwords.bearblog.dev/why-domain-specific-llms-wo...

hartator14d ago

What a waste of time and money.

Trying to force a LLM into a specific language makes you missed out on most of the world knowledge.

embedding-shape14d ago

What LLM isn't forced into a specific language? That'd be a weird language model no one could understand, you need to chose at least one language, ideally the same as the creators speak.

Miraste14d ago

To my knowledge, all major LLMs are multilingual. This article could really have used an evaluation of existing models' European Portuguese capabilities.

numpad014d ago

yeah, they seem all confined to being an American-consultant-Chinese-authoritarian split personality with broad second language capabilities. I suppose they become too incoherent otherwise.

cess1114d ago

E.g. gemma3:4b can fake simple conversations in several european languages, including portuguese, swedish and finnish.

It's just a database. If you push text in one language into it, it'll likely crap out stuff in that same language, unless the system prompt that also goes in with your query causes it not to.

KK7NIL14d ago

This is how Europe thinks they can catch up on tech, by having the government fund vanity projects which will be made obsolete by more general techniques in 6 months.

mistrial914d ago

> makes you missed out on most of the world knowledge

and, who knows what will happen to grammar ?

j / k navigate · click thread line to collapse