I am working on making localization effortless via dev tools and a dedicated editor for translators. Both pillars have one common denominator: translations as data in source code. Treating translations as code would break that denominator and prevent a coherent end-to-end solution.
Take a look at the repository https://github.com/inlang/inlang. The IDE extension already solves type safety, inline annotations, and (partially) extraction of hardcoded strings.
The argument against IDs is the reduced readability. Something that can be solved with the IDE extension.
Is there something else that bothers you in localization pipelines?
It's a big lift to extract all hardcoded strings for a future state where localization will be 'required', especially for large companies. There's no question non-technical teams need the ability to edit strings/translations but if it means changing your infra or the way eng prefers to build it's a tough argument.
We've been building https://www.flycode.com as a platform to make strings/translations and static assets (hardcoded or in resource files) editable by connecting existing repos.
MessageFormat is code. It's just not a very powerful language. And it knows nothing about rich text, just plain strings, which means that you have to deal with manual HTML decoding in the application, ensure all translations are actually producing valid HTML and absolutely not forget to encode all string params that could be user input.
Using tagged template literals and JSX in the translations avoids all those problems.
What do you think about Mozilla's Fluent format/syntax https://projectfluent.org/?
BTW feel free to reach out via email to me. Look at my profile to find it.
With data / config, the translations are recorded in one place and all consumers can get the update without code changes.
The big thing I've been wondering / looking for is a shared, open source translation database. Anyone have links?
This is built-in some of the top software translating platforms to "seed" the initial translation. A bulk kickstart that can optionally later be refined by human translators.
That the UI is not in English does not mean that a non-English person will be able to understand it and use it successfully.
You can only do it if you do not have any kind of support for those international users and if those users are not your real customers but merely statistics in the usage dashboard of a free product.
That's a neat idea. It'll be super useful for 80% of the cases, where context is that important. But for the rest of the 20%, context of where the translation will be used, is as important as the word itself. So you cannot always reuse the same translation in different contexts, as it'll sound unnatural then.
Still, if there was a easy solution for being able to change between different options for the translation, having a shared open source translation database for projects to use, would be very valuable and useful.
Jokes aside I don't hate the idea and is actually quite positive to writing translation in code. I am a bit questioning of why you would need a new language for it though, why not use an existing programming language?
As others pointed out here the biggest downside I can see is that it would be harder to outsource.
- Community provided translations are now a remote code execution vector, and can steal your passwords instead of merely displaying rude words. You should now audit all translations up front before manually merging, instead of merely, say, locking down a writeable-by-default wiki after your first abuse occurs.
- Translation code is unlikely to be given a nice stable semvered sandboxed API boundary. Less of an issue for in-house translation where translators are working against the same branch as everyone else, more of an issue when outsourcing translation - when you get a dump of translations weeks/months later referencing refactored APIs, some poor fellow will need to handle the integration manually.
- Hot reloading and error recovery is likely an afterthought at best, for similar reasons. Translation typos are now likely to break your entire build, not just individual translation strings.
- Translators must now reproduce your code's build environment to preview translations.
(Code-based translations may still make sense for some projects/organizations despite these drawbacks, but these are some of the reasons dedicated translation DSLs encoded as "data" can make sense for other projects/organizations)
2. Why? Keys with params will break no matter if it is in MessageFormat or TypeScript. At least with TypeScript you will know something is wrong and can comment out the problematic key in question.
3. And that is great! Bugs should break builds.
4. Well, that could happen. But you could also structure your localizations into a stand-alone subproject and then it would no longer be the case.
Plenty of non-OSS projects out there, using services like https://www.localizor.com/ or in-house equivalents, and no public VCS. OSS projects accepting translations via PR can also spend less time reviewing code, if they can just rubber stamp changes to translation data, instead of auditing changes to translation code from new contributors.
> Why? Keys with params will break no matter if it is in MessageFormat or TypeScript.
Generally in C++ projects I end up with roughly the following:
[rest of codebase] <----> [translation bindings] <----> [translation data]
Refactoring types (say, changing a field to a function) in "rest of the codebase" will inadvertently cause changes to the translation bindings, but since that code is already remapping from C++ types/params to translation specific types/params, the latter - and thus translation data - is frequently unchanged.When you bypass this with code:
[rest of codebase] <----> [translation functions]
The lack of a binding layer means refactoring types in "rest of the codebase" by definition refactors translation types as well - and thus translation data must change.JavaScript's dynamically typed, bag of string keyed objects, can subvert the need for and existence of a translation binding layer when blindly forwarded, so I suppose MessageFormat isn't a 100% win here either. And, in theory, you can have a translation binding layer without being data-driven, I'm just skeptical that people will bother to strictly enforce it's usage.
> 3. And that is great! Bugs should break builds.
Not all bugs should break all builds on large scale projects / in large scale orgs. A typo or missing string in the french translation of gmail should not break google search, even though google is monorepository. When you have thousands of employees, something will be broken by somebody at all times and progress will crawl to a halt as everyone gets blocked by everyone else - even with CI, someone will have a bypass option, or will pass preflight but not full CI, or ...
Constantly alerting programmers about unactionable CI failures merely trains programmers to ignore CI. Broken translations should perhaps be surfaced to the localization team, and perhaps QA or a project manager who can escalate things if localization drops the ball - but proper fault isolation should avoid breaking everything for everyone, and instead limit the fault to those whom said fault is actionable. A graphics or physics programmer in gamedev probably shouldn't be tasked to fix french localization typos, at least by default.
This is especially true for localization - localization always lags behind the tip of development, and is arguably always broken/buggy except for the occasions when you pick a version to stabilize, wait for translations, and release. Why should some localization errors (missing strings) be handled through placeholders, yet others (bad syntax) break the build for programmers, when the same non-programmers (localizers, project managers) should generally be in charge of fixing both?
> 4. Well, that could happen. But you could also structure your localizations into a stand-alone subproject and then it would no longer be the case.
At the very least, translators must now reproduce the build environment of the localiations subproject (so, in the context of the original article's github repository as currently stands, they'd need to install make + pnpm (+ tsc? or will pnpm auto-install tsc? will it auto-update, or will using new syntax require updating tsc for translators?)
Worked well for my use case but still needs more progress to be fully featured across all supported programming languages, for example, i found some more advanced features missing in the Rust implementation. Really worth checking out.
Looks very similar to https://unicode-org.github.io/icu/
One extension I had to do though was extend Java's Properties files to preserve order and allow duplicate keys. Then they can be used to populate drop-down options, too.
For one thing, we get our translations by handing a yaml file to external contractors. They don't need to squint at a file full of code to distinguish the bits of english that need translating from the bits that don't – they just have to translate the right side of every key, and there's specialized tooling to help them with this.
And for another, even in your toy example in the readme you've now lost a Single Source of Truth for certain presentation decisions. So now when some stakeholder comes to you and says they hate the italicization in the intro paragraph and to lose it ASAP, instead of taking the markup out of a common template that different data gets inserted into, you have to edit each language's version of the code to remove the markup (with all of the attendant ease of making errors that comes along when you lack a SPOT – easy to miss one language, etc). I'd expect these kinds of multiplication-of-edit problems to grow increasingly complex when you scale this approach beyond toy examples.
Basically this seems really hard to scale to large products, and doesn't play well with division of labour.
But I'm interested to hear how you would solve the presentation issues you mention. I absolutely think the right way is to have translations be HTML fragments. How else would you know what part of the sentence should be italic or contain a hyperlink?
You give them some context, and let them ask you questions if they feel things are too ambiguous for them to produce an accurate translation for the context it will be used in. In some cases we will include a screenshot of the rendered English page/component/etc so that the translator can map the key values they're seeing to the presentation context.
I can only tell you that this process has scaled to 10s of millions in sales in foreign languages, and that the translation services we use absolutely do not have any time or interest in signing additional NDAs around source code, in getting their employees set up with bespoke code and dev environments, etc. It would be a gigantic drag on their business model.
> I absolutely think the right way is to have translations be HTML fragments.
These translators do not know HTML and are not going to be able to work with it in any way – again, this would require the services to totally overhaul their business model, and spend a bunch of money/time on training or hiring more specialized translators with HTML/CSS skills, which they have no interest in doing.
It would also open up a threat model that's currently non-existent for us. Total non-starter.
> How else would you know what part of the sentence should be italic or contain a hyperlink?
Translation keys contain a simple substitutional form that can be replaced on key lookup, so
some.introductory.paragraph: Call to action: %{click_here}
a.fancy.link.name: Click to purchase!
in code: t('.some.introductory.paragraph', click_here: link(target_url, t('.a.fancy.link.name'))
The developer can inject formatting that way if necessary, etc, although generally speaking this is a really rare use-case in my experience: randomly italicizing or bolding or otherwise styling words in a paragraph looks fairly unprofessional/isn't typically done.You can't have a single source of truth for presentation decisions in a multilingual product. Different languages have different typographic traditions, will demand different minimum container sizes based on word lengths and maybe this is shocking but they sometimes run in different directions. If you are not integrating the dev, design and localized copy editing roles on your team, your product is going to look like trash except where the primary language of the team is concerned.
Translation can scale for large products, but localization cannot: until further notice, you can only do it the hard way, or the wrong way.
Maybe this is shocking but I'm fluent in a language that is sometimes written veritcally.
"You can't have one single common presentation for every translation" is true in an absolute sense but often not true in practice – eg) we hit most of Europe and North, Central, and South America with ~10 static translations rendered into one common presentational template, none of which run into any of the truly complex layout differences that right-to-left or vertical presentations would bring. We extensively QA all of the languages we do support, and presentation issues are truly pretty damn rare. It's your classic "80% of the result for 20% of the effort" tradeoff.
Now, if you truly do need to localize in every language under the sun then yeah, something like this can make sense, as it gives you maximum flexibility wrt to varying your layout alongside the translation.
But if you have any simpler use-case (eg. supporting just English, Spanish, French and Portuguese will give you an enormous chunk of the planet with minimal overhead, as they have very similar word lengths and presentation requirements) then the approach here is just taking on all of the effort and maintenance overhead of the maximally-complex case when you have absolutely no need to.
But I think code is, in general, something to be avoided when declarative approaches are available.
Declarative is easier for a computer to understand, it restricts the inputs to one domain the computer can deal with.
You don't get the same classes of bugs with declarative. You could even do things like double checking with machine translation and flagging anything that doesn't match for human review.
Plus, you don't need a programmer to do it. Security issues go away. You often achieve very good reuse with code only existing in one place without language variants.
I'm sure there are great uses for this, but I have trouble thinking of even a single case where I'd prefer code to data in general.
I'm arguing that each implemented language in a program should be able to define its own set of utility functions that makes sense for that particular language.
Also, since translations as code can return both plain strings and (for example) HTML fragments, security is instead increased, because encoding/decoding would no longer be an issue.
Of course, the problem is that implementations like this are actually stepping away from the very good NLG system we already have: human translators, who typically aren't coders. And the need for NLG hasn't gone away -- someone still has to hardcode these (parameterized) strings.
Currently Mozilla Fluent seems like a good compromise implementation. The type checking is maybe not as advanced, but it is intended to be compatible with the tools most often used in localization to enable translators to handle all the data and organize the task. Very straightforward getting generated localized strings to agree in number, tense, gender, and so on.
I like Fluent. It's just that ... when all the power of modern JavaScript/TypeScript (template literals/JSX/validation/custom functions) and code editors (syntax highlighting/JSDoc/references/usages) are already in place, why not use it instead of introducing a whole new layer of tooling?
For large projects it would still be possible to have a utility program export a translation to CSV/whatever and re-create a ts file from it when it comes back.
Any mistakes made by the translator would immediately show up when building the program or just visiting the file in a code editor.
What if you could get static type checking, key documentation and code completion right in VS Code?
And what if the translations could be generated using an actual programming language, and even represent HTML markup and not just plain strings?
What would be useful is the ability to interactively see a systematic set of examples of what the templates one is editing evaluate to.
But since my translations are code, it basically means I would have to invoke a debugger on the full program, so that's a drawback.
On the other hand, since my translations are code I suppose I could just add something like a unit test or something.
That fails pretty badly in two cases:
1) If significant changes to the English (or whatever) version need to be made, keeping the original text may be more confusing than useful.
2) When the native-language version is ambiguous in a way that doesn't apply to other languages, e.g. when translating to languages with grammatical gender, or when a single English word can be used in multiple unrelated ways.