Show HN: Localization and translations should be code, not data | Better HN

Show HN: Localization and translations should be code, not data | Better HN

60 comments

samuelstros3y ago

Since months I am working on an open source localization solution that tackles both developer and translator facing problems. Treating translations as code completely leaves out translators, who in most cases can not code.

I am working on making localization effortless via dev tools and a dedicated editor for translators. Both pillars have one common denominator: translations as data in source code. Treating translations as code would break that denominator and prevent a coherent end-to-end solution.

Take a look at the repository https://github.com/inlang/inlang. The IDE extension already solves type safety, inline annotations, and (partially) extraction of hardcoded strings.

slaymaker19073y ago

As someone who has dealt with localization pipelines before, I totally agree with your sentiment. People doing translation work should not need to deal with code. I like that you opted for translation IDs, though that can get messy switching back and forth to know what the English (or whatever the base language is) actually says. The IDs are somewhat worth it though since you can try and use old translation files unlike the gettext methodology which looks for 1-1 string matches.

samuelstros3y ago

IDs vs base language string as ID is a common debate. I opted for translation IDs since Mozilla's Fluent (https://projectfluent.org/) uses translation IDs. I can't find their list of reasons. I do remember having problems myself by changing the base language string and thereby losing the connection to all translations.

The argument against IDs is the reduced readability. Something that can be solved with the IDE extension.

Is there something else that bothers you in localization pipelines?

JakeVacovec3y ago

> Treating translations as code completely leaves out translators, who in most cases can not code.

It's a big lift to extract all hardcoded strings for a future state where localization will be 'required', especially for large companies. There's no question non-technical teams need the ability to edit strings/translations but if it means changing your infra or the way eng prefers to build it's a tough argument.

We've been building https://www.flycode.com as a platform to make strings/translations and static assets (hardcoded or in resource files) editable by connecting existing repos.

LeviticusMBOP3y ago

Anyone can learn how to call a function, just like they can learn how to splice a parameter into a string. And translators already have to know basic HTML anyway.

MessageFormat is code. It's just not a very powerful language. And it knows nothing about rich text, just plain strings, which means that you have to deal with manual HTML decoding in the application, ensure all translations are actually producing valid HTML and absolutely not forget to encode all string params that could be user input.

Using tagged template literals and JSX in the translations avoids all those problems.

samuelstros3y ago

It's a tempting argument. By interviewing hundreds of people a different pattern emerged though. Translators don't know how to code. Some companies manually removed quotation marks (") from strings because they confused translators.

What do you think about Mozilla's Fluent format/syntax https://projectfluent.org/?

BTW feel free to reach out via email to me. Look at my profile to find it.

danieltanfh953y ago

I honestly don't understand why translation libraries work with JSON of all things. In a typical pipeline translators work with excel sheets and word documents not code.

https://www.npmjs.com/package/csv-translations

samuelstros3y ago

Are you using CSVs to store your translations in source code?

yashasolutions3y ago

wow love the interface. I have been working with weblate for the past year and while it has a lot to offer, it feels very heavy.

samuelstros3y ago

The interface shown in the GitHub readme is an old version. The new version goes in the same direction but will look a bit differently. One thing is sure though: Strive for simplicity. I haven't come across a good editor for translators yet.

The problem I see with this is that every language would need to replicate the code & logic.

With data / config, the translations are recorded in one place and all consumers can get the update without code changes.

The big thing I've been wondering / looking for is a shared, open source translation database. Anyone have links?

samuelstros3y ago

Since I am working on an open source localization solution (that makes localization of software effortless), having an open source "translation memory" database makes sense. I will keep this idea in my mind! :)

lwouis3y ago

Context-less translation can be done quite successfully these days with online services. You could simply make a few hundred calls to something like Google Translate and get good quality translations in multiple languages.

This is built-in some of the top software translating platforms to "seed" the initial translation. A bulk kickstart that can optionally later be refined by human translators.

antaviana3y ago

As someone in the localization business, let me assure you that, with the current state of the art, using machine translation without any kind of human post-editing for UI is a terrible idea.

That the UI is not in English does not mean that a non-English person will be able to understand it and use it successfully.

You can only do it if you do not have any kind of support for those international users and if those users are not your real customers but merely statistics in the usage dashboard of a free product.

capableweb3y ago

> The big thing I've been wondering / looking for is a shared, open source translation database. Anyone have links?

That's a neat idea. It'll be super useful for 80% of the cases, where context is that important. But for the rest of the 20%, context of where the translation will be used, is as important as the word itself. So you cannot always reuse the same translation in different contexts, as it'll sound unnatural then.

Still, if there was a easy solution for being able to change between different options for the translation, having a shared open source translation database for projects to use, would be very valuable and useful.

The (surmountable) problem is tree-shaking so you only include the translations you use

olodus3y ago

"You tasked me with translating this scene, so since you gave me a general programming language I used a buffer overflow to break out into the animation engine and animate your characters to use sign language."

Jokes aside I don't hate the idea and is actually quite positive to writing translation in code. I am a bit questioning of why you would need a new language for it though, why not use an existing programming language?

As others pointed out here the biggest downside I can see is that it would be harder to outsource.

LeviticusMBOP3y ago

Well that's the point. It's not anything new at all, just the plain JavaScript/TypeScript/TSX you already use. No extra tools are required either; TypeScript and a good editor like VS Code handles it all.

MaulingMonkey3y ago

Caveats:

- Community provided translations are now a remote code execution vector, and can steal your passwords instead of merely displaying rude words. You should now audit all translations up front before manually merging, instead of merely, say, locking down a writeable-by-default wiki after your first abuse occurs.

- Translation code is unlikely to be given a nice stable semvered sandboxed API boundary. Less of an issue for in-house translation where translators are working against the same branch as everyone else, more of an issue when outsourcing translation - when you get a dump of translations weeks/months later referencing refactored APIs, some poor fellow will need to handle the integration manually.

- Hot reloading and error recovery is likely an afterthought at best, for similar reasons. Translation typos are now likely to break your entire build, not just individual translation strings.

- Translators must now reproduce your code's build environment to preview translations.

(Code-based translations may still make sense for some projects/organizations despite these drawbacks, but these are some of the reasons dedicated translation DSLs encoded as "data" can make sense for other projects/organizations)

LeviticusMBOP3y ago

1. Usually OSS projects accept patches via PRs. Translations are no different for 99% of all projects.

2. Why? Keys with params will break no matter if it is in MessageFormat or TypeScript. At least with TypeScript you will know something is wrong and can comment out the problematic key in question.

3. And that is great! Bugs should break builds.

4. Well, that could happen. But you could also structure your localizations into a stand-alone subproject and then it would no longer be the case.

MaulingMonkey3y ago

> 1. Usually OSS projects accept patches via PRs. Translations are no different for 99% of all projects.

Plenty of non-OSS projects out there, using services like https://www.localizor.com/ or in-house equivalents, and no public VCS. OSS projects accepting translations via PR can also spend less time reviewing code, if they can just rubber stamp changes to translation data, instead of auditing changes to translation code from new contributors.

> Why? Keys with params will break no matter if it is in MessageFormat or TypeScript.

Generally in C++ projects I end up with roughly the following:

    [rest of codebase] <----> [translation bindings] <----> [translation data]

Refactoring types (say, changing a field to a function) in "rest of the codebase" will inadvertently cause changes to the translation bindings, but since that code is already remapping from C++ types/params to translation specific types/params, the latter - and thus translation data - is frequently unchanged.

When you bypass this with code:

    [rest of codebase] <----> [translation functions]

The lack of a binding layer means refactoring types in "rest of the codebase" by definition refactors translation types as well - and thus translation data must change.

JavaScript's dynamically typed, bag of string keyed objects, can subvert the need for and existence of a translation binding layer when blindly forwarded, so I suppose MessageFormat isn't a 100% win here either. And, in theory, you can have a translation binding layer without being data-driven, I'm just skeptical that people will bother to strictly enforce it's usage.

> 3. And that is great! Bugs should break builds.

Not all bugs should break all builds on large scale projects / in large scale orgs. A typo or missing string in the french translation of gmail should not break google search, even though google is monorepository. When you have thousands of employees, something will be broken by somebody at all times and progress will crawl to a halt as everyone gets blocked by everyone else - even with CI, someone will have a bypass option, or will pass preflight but not full CI, or ...

Constantly alerting programmers about unactionable CI failures merely trains programmers to ignore CI. Broken translations should perhaps be surfaced to the localization team, and perhaps QA or a project manager who can escalate things if localization drops the ball - but proper fault isolation should avoid breaking everything for everyone, and instead limit the fault to those whom said fault is actionable. A graphics or physics programmer in gamedev probably shouldn't be tasked to fix french localization typos, at least by default.

This is especially true for localization - localization always lags behind the tip of development, and is arguably always broken/buggy except for the occasions when you pick a version to stabilize, wait for translations, and release. Why should some localization errors (missing strings) be handled through placeholders, yet others (bad syntax) break the build for programmers, when the same non-programmers (localizers, project managers) should generally be in charge of fixing both?

> 4. Well, that could happen. But you could also structure your localizations into a stand-alone subproject and then it would no longer be the case.

At the very least, translators must now reproduce the build environment of the localiations subproject (so, in the context of the original article's github repository as currently stands, they'd need to install make + pnpm (+ tsc? or will pnpm auto-install tsc? will it auto-update, or will using new syntax require updating tsc for translators?)

With good checking of types and this should not be such a problem. Most languages restrict themselves to letters, accents, and punctuation so that programming syntax and use of symbols can be detected and flagged.

tmpfs3y ago

Having worked in the localization space over a decade ago when gettext was still the industry standard I was pleased recently to use Fluent which I think is a better more modern approach:

https://projectfluent.org/

Worked well for my use case but still needs more progress to be fully featured across all supported programming languages, for example, i found some more advanced features missing in the Rust implementation. Really worth checking out.

midenginedcoupe3y ago

Nice.

Looks very similar to https://unicode-org.github.io/icu/

One extension I had to do though was extend Java's Properties files to preserve order and allow duplicate keys. Then they can be used to populate drop-down options, too.

msbarnett3y ago

It's a neat idea but by intermixing code, presentation, and data you're going to run into a bunch of issues that the "traditional" approach avoids.

For one thing, we get our translations by handing a yaml file to external contractors. They don't need to squint at a file full of code to distinguish the bits of english that need translating from the bits that don't – they just have to translate the right side of every key, and there's specialized tooling to help them with this.

And for another, even in your toy example in the readme you've now lost a Single Source of Truth for certain presentation decisions. So now when some stakeholder comes to you and says they hate the italicization in the intro paragraph and to lose it ASAP, instead of taking the markup out of a common template that different data gets inserted into, you have to edit each language's version of the code to remove the markup (with all of the attendant ease of making errors that comes along when you lack a SPOT – easy to miss one language, etc). I'd expect these kinds of multiplication-of-edit problems to grow increasingly complex when you scale this approach beyond toy examples.

Basically this seems really hard to scale to large products, and doesn't play well with division of labour.

LeviticusMBOP3y ago

So let me just first say that just sending a CSV/JSON/YAML/whatever file to professional translators and expect good results back is just not going to work. We've done that and sometimes the context is just horribly wrong. The only way to get good results is for the translators to actually see the UI or even better run the app themselves.

But I'm interested to hear how you would solve the presentation issues you mention. I absolutely think the right way is to have translations be HTML fragments. How else would you know what part of the sentence should be italic or contain a hyperlink?

msbarnett3y ago

> So let me just first say that just sending a CSV/JSON/YAML/whatever file to professional translators and expect good results back is just not going to work. We've done that and sometimes the context is just horribly wrong. The only way to get good results is for the translators to actually see the UI or even better run the app themselves.

You give them some context, and let them ask you questions if they feel things are too ambiguous for them to produce an accurate translation for the context it will be used in. In some cases we will include a screenshot of the rendered English page/component/etc so that the translator can map the key values they're seeing to the presentation context.

I can only tell you that this process has scaled to 10s of millions in sales in foreign languages, and that the translation services we use absolutely do not have any time or interest in signing additional NDAs around source code, in getting their employees set up with bespoke code and dev environments, etc. It would be a gigantic drag on their business model.

> I absolutely think the right way is to have translations be HTML fragments.

These translators do not know HTML and are not going to be able to work with it in any way – again, this would require the services to totally overhaul their business model, and spend a bunch of money/time on training or hiring more specialized translators with HTML/CSS skills, which they have no interest in doing.

It would also open up a threat model that's currently non-existent for us. Total non-starter.

> How else would you know what part of the sentence should be italic or contain a hyperlink?

Translation keys contain a simple substitutional form that can be replaced on key lookup, so

     some.introductory.paragraph: Call to action: %{click_here}
     a.fancy.link.name: Click to purchase!

in code:

     t('.some.introductory.paragraph', click_here: link(target_url, t('.a.fancy.link.name'))

The developer can inject formatting that way if necessary, etc, although generally speaking this is a really rare use-case in my experience: randomly italicizing or bolding or otherwise styling words in a paragraph looks fairly unprofessional/isn't typically done.

bananarchist3y ago

> Single Source of Truth for certain presentation decisions.

You can't have a single source of truth for presentation decisions in a multilingual product. Different languages have different typographic traditions, will demand different minimum container sizes based on word lengths and maybe this is shocking but they sometimes run in different directions. If you are not integrating the dev, design and localized copy editing roles on your team, your product is going to look like trash except where the primary language of the team is concerned.

Translation can scale for large products, but localization cannot: until further notice, you can only do it the hard way, or the wrong way.

msbarnett3y ago

> You can't have a single source of truth for presentation decisions in a multilingual product. Different languages have different typographic traditions, will demand different minimum container sizes based on word lengths and maybe this is shocking but they sometimes run in different directions.

Maybe this is shocking but I'm fluent in a language that is sometimes written veritcally.

"You can't have one single common presentation for every translation" is true in an absolute sense but often not true in practice – eg) we hit most of Europe and North, Central, and South America with ~10 static translations rendered into one common presentational template, none of which run into any of the truly complex layout differences that right-to-left or vertical presentations would bring. We extensively QA all of the languages we do support, and presentation issues are truly pretty damn rare. It's your classic "80% of the result for 20% of the effort" tradeoff.

Now, if you truly do need to localize in every language under the sun then yeah, something like this can make sense, as it gives you maximum flexibility wrt to varying your layout alongside the translation.

But if you have any simpler use-case (eg. supporting just English, Spanish, French and Portuguese will give you an enormous chunk of the planet with minimal overhead, as they have very similar word lengths and presentation requirements) then the approach here is just taking on all of the effort and maintenance overhead of the maximally-complex case when you have absolutely no need to.

The localization library I use supports most of this. Not all, it's not a general purpose programming language of course, but it supports variables and conditionals, which is basically enough to do almost anything.

https://formatjs.io/docs/react-intl/api#message-syntax

LeviticusMBOP3y ago

That's MessageFormat and I think its a pain and severely limited. Maybe it's OK for English, which has really simple grammatical rules, but just add some gender to your plurals and it starts to become very complicated very quick.

eternityforest3y ago

I'm not quite sure I agree with the title. Having access to code when you need it is probably a good thing.

But I think code is, in general, something to be avoided when declarative approaches are available.

Declarative is easier for a computer to understand, it restricts the inputs to one domain the computer can deal with.

You don't get the same classes of bugs with declarative. You could even do things like double checking with machine translation and flagging anything that doesn't match for human review.

Plus, you don't need a programmer to do it. Security issues go away. You often achieve very good reuse with code only existing in one place without language variants.

I'm sure there are great uses for this, but I have trouble thinking of even a single case where I'd prefer code to data in general.

LeviticusMBOP3y ago

Building complex sentences absolutely require some form of logic. Different languages have very different rules for gender, plural, sexus, classes ( animated/humane/plants/etc).

I'm arguing that each implemented language in a program should be able to define its own set of utility functions that makes sense for that particular language.

Also, since translations as code can return both plain strings and (for example) HTML fragments, security is instead increased, because encoding/decoding would no longer be an issue.

The idea is appealing, I think, because it feels like a step toward what is surely the ultimate goal: flawless natural language generation from some semantic encoding. If you squint, these functions and arguments are the semantic encoding, and their implementations are doing their best to imitate the NLG for an extremely limited domain.

Of course, the problem is that implementations like this are actually stepping away from the very good NLG system we already have: human translators, who typically aren't coders. And the need for NLG hasn't gone away -- someone still has to hardcode these (parameterized) strings.

he00013y ago

I worked with localizations and the main issue were that the translators didn’t code, so we had to keep the localizations separate from the code as the translators had no idea how to deal with it. Another issue we had was that not all languages reads left-to-right but sometimes right-to-left or up-down. And sometimes formatting in one language makes sense but in some other the formatting didn’t make sense. Languages don’t follow a main pattern, which sometimes makes it hard to automate. We tried google translate but it kept translating things into garbage so we couldn’t use that.

This idea of localization as code has significant history from Perl: https://perldoc.perl.org/Locale::Maketext::TPJ13

Currently Mozilla Fluent seems like a good compromise implementation. The type checking is maybe not as advanced, but it is intended to be compatible with the tools most often used in localization to enable translators to handle all the data and organize the task. Very straightforward getting generated localized strings to agree in number, tense, gender, and so on.

LeviticusMBOP3y ago

That Perl reference was new to me. Very interesting. Never localized my Perl programs back in the day. But of course I should not be surprised that Perl had a solution to any given problem 20+ years ago ...

I like Fluent. It's just that ... when all the power of modern JavaScript/TypeScript (template literals/JSX/validation/custom functions) and code editors (syntax highlighting/JSDoc/references/usages) are already in place, why not use it instead of introducing a whole new layer of tooling?

samuelstros3y ago

I have been thinking about generating types from Fluent files. That would give you most benefits that you as developer seek from translations as code, wouldn't it?

rakshithbellare3y ago

What would be process for handoff from translators to programmers?

LeviticusMBOP3y ago

Well. For small projects or even larger open source projects I would expect the translators to just clone a ts file (which would basically just be JSON, but with comments, template strings, function calls), change the translations and contribute the file as-is.

For large projects it would still be possible to have a utility program export a translation to CSV/whatever and re-create a ts file from it when it comes back.

Any mistakes made by the translator would immediately show up when building the program or just visiting the file in a code editor.

LeviticusMBOP3y ago

Making localized web apps is such a pain and too often an afterthought. But what if it took almost no extra effort to make the app localized from the start?

What if you could get static type checking, key documentation and code completion right in VS Code?

And what if the translations could be generated using an actual programming language, and even represent HTML markup and not just plain strings?

capableweb3y ago

Sounds like a great idea for translators who are also programmers, or at least knows HTML (and syntax for logic, judging by your examples). But I haven't worked in any companies where the translators/the people doing localization have been programmers, they have just been translators. This will be more or less impossible for them to use efficiently, if at all.

LeviticusMBOP3y ago

Well, you have to start somewhere. Anyone can learn how to call a function, just like they can learn MessageFormat. And basic HTML is something translators must already know.

layer83y ago

…then translators need to be programmers, or vice versa. That may not scale to many languages/large products.

What would be useful is the ability to interactively see a systematic set of examples of what the templates one is editing evaluate to.

LeviticusMBOP3y ago

Example parameters for a specific key and the ability to preview the different translations based on those examples is something I wished for.

But since my translations are code, it basically means I would have to invoke a debugger on the full program, so that's a drawback.

On the other hand, since my translations are code I suppose I could just add something like a unit test or something.

withinboredom3y ago

One solution is to use your native language as the key. Bam, you have context in the code and when testing. No need for shenanigans (and this is how it was done until someone decided to popularize opaque keys in the last decade or so, in fact, most battled-hardened and old libraries expect it to be done that way). You can translate English to English (or whatever) if you want to be able to change the wording without having to retranslate everything… but then if you are changing the wording for the native language, don’t you have to retranslate everything anyway?

> One solution is to use your native language as the key.

That fails pretty badly in two cases:

1) If significant changes to the English (or whatever) version need to be made, keeping the original text may be more confusing than useful.

2) When the native-language version is ambiguous in a way that doesn't apply to other languages, e.g. when translating to languages with grammatical gender, or when a single English word can be used in multiple unrelated ways.

spacemanmatt3y ago

This is basically what I would do with exposing Velocity templates to translation users. Technically it's coding but the scope is limited to text rendering.

j / k navigate · click thread line to collapse