How we made Typerighter, the Guardian’s style guide checker (opens in new tab)

(theguardian.com)

182 pointscamillovisini5y ago33 comments

33 comments

For those unaware "Grauniad" is a decades-old nickname for The Guardian, used particularly by satirical mag Private Eye in reference to its reputation at one time for typos and the like.

https://wordhistories.net/2017/06/05/origin-of-grauniad/

triceratops5y ago

> reputation at one time for typos and the like

Well-earned into the present day. I regularly see typos even today.

foldr5y ago

I was recently moved to write to them on encountering the following completely unparsable sentence:

"People of ethnic group membership can change over time and with age."

They fixed it eventually (though I doubt it had anything to do with my comment).

There was also the following opinion piece, which still makes the utterly absurd claim that Jesse Jackson campaigned for the capitalization of 'African American':

https://www.theguardian.com/commentisfree/2020/oct/21/black-...

(I hope the fact that both of my examples happen to relate somewhat to race doesn't make me sound like an alt right troll. I'm sympathetic to the article. It just seems that there was a major fact checking or editing fail.)

ggm5y ago

name me a tier-1 paper without typos. the point about the graun, is how entertaining they could be.

triceratops5y ago

Honestly I don't see that many typos in other papers; grammatical issues, sure. But it would appear that not knowing what the squiggly red underlined words are is a requirement of being a Graun writer or editor.

2 more replies

mvidal015y ago

I bet some of those those typos came about because the type was hand set.

cormullion5y ago

not sure "hand" would be true, though:

https://www.theguardian.com/gnm-archive/gallery/2016/nov/18/...

modernerd5y ago

When I saw “13,000 regexes” I thought of the adage, “the plural of regex is regrets”.

But here it seems like a good choice to build on a battle-tested library of regrets, and it's clearly working well for them.

The demo looks slicker than the typical Grammarly/MS Word/native macOS grammar and spelling corrections, for those who missed it: https://www.youtube.com/watch?v=Yl0nb94N98k&feature=emb_imp_...

And the ability to flag false positives, send suggestions back, and see metrics of how the system's being used is just awesome.

jawns5y ago

As a college student (nearly 20 years ago) I built a tool for our student newspaper that caught frequent violations of the Associated Press style guide. Later, working as a newsroom copy editor, I was shocked at how few tools there were available to enforce style. Really awesome to see this Typerighter tool do it right.

teach5y ago

This is wonderful. I love to see technology enhancing experts' ability to do what they already do, but faster/more accurately.

Also, I'm a big fan of regex. I think -- probably thanks to jwz's famous quote -- a lot of younger programmers avoid them but they're fantastic for MATCHING. Using them in a Google sheet is a killer MVP to prove out something like this.

smt885y ago

I'm good at reading/writing regex and use them a lot, but I always worry about their maintainability. They're a common source of hard-to-pinpoint bugs.

I suppose I still use them because I don't know of a better way to do things.

js_herbert5y ago

We were amazed at how far we were able to get with them – if solving a problem with a regular expression produces two problems, we should now have 13,000 problems. The fact that they worked so well is due to the work of the subeditor who compiled (and still maintains!) the rule corpus – as well as the sheer volume, there are quite a few carefully ordered rules. Because style guide matches are reasonably sparsely found in content, and usually reasonably specific as to what matches (even if it's difficult to produce a correction) it turned out to be a surprisingly tractable problem to produce something useful with regular expressions alone – but we'd never have discovered that was the case unless someone had spent literally years doing it!

General maintainability is a priority, and we'd like to improve our rule management tooling to make the process of rule maintance generally accessible to editorial staff. We're also working on making noisy rules match more specifically, which usually involves migrating the initial regex into Languagetool for e.g. pattern-matching on part-of-speech.

chimprich5y ago

Whoever came up with the name "Typerighter" for this project should feel very pleased with themselves.

kimburgess5y ago

Relying on purely on regex misses so much context available from a document. I've been working on some tooling [1] in this space recently and a core epiphany was noting you can model written language as an AST and then reason about it in this form rather than opaque blocks of text (or flat, sequential text fragments as with Typerighter). An even better realisation was that others had already noted this too and built a mature ecosystem based on this concept [2].

[1]: https://github.com/place-labs/orthograph-err

[2]: https://textlint.github.io/

js_herbert5y ago

This is definitely true – in this sense, our initial corpus of regexes are the booster stage for this project, in that they enabled us to produce something useful for journalists in a reasonable timeframe. Typerighter's built as a platform for matching text, so we're not tied to regex – at the moment, we're migrating many rules to LanguageTool, which is a part of our pool of matchers and has a more sophisticated set of NLP tools. (And a great project – thanks LT maintainers!)

Thanks sharing these projects, other suggestions are very welcome – we'd be interested in adding new matchers based on different tech if they were a good fit for the use case.

RobAley5y ago

Will you (are you) contributing any of the rules back to TL? Or are they to specific to your org?

js_herbert5y ago

Taking a look at the corpus, the rules we have currently migrated are very specific to our style guide, and we'd likely be unable to contribute large chunks of the corpus for IP reasons. But this certainly seems possible for more general grammar or style corrections if there was a need – although LT's lists of rules are already quite comprehensive!

danpalmer5y ago

I had this same thought, but I wonder if it really matters for this use-case. The rules are actually quite simple much of the time – they're spelling and stylistic corrections.

I suspect the biggest problem with using regexes is over-suggestion, trying to correct American English spellings in a quote for example, but I suspect this is a pretty good balance of features, usability, and correctness.

One issue that comes with more complex systems like you mention is that the bugs become more complex. I'd imagine it's fairly easy for a journalist using this tool to know why an incorrect suggestion has been made, and that makes it easy for them to disregard it. While the error rate may improve with more complex analysis, those errors that do still happen are likely to be less understandable.

dtrizzle5y ago

This project reminded me a proselint, which appears to be a similar style checker. Sadly, that project appears to have been inactive for at least three years.

https://github.com/amperser/proselint

mrkwse5y ago

I saw a journalist share Typerighter on Twitter and was intrigued, so I'm looking forward to reading this.

It's a bit surprising that the engineering blog appears to be embedded in the main site, though. I've worked at a news org in the past (admittedly much larger) and the engineering/meta blogs were entirely separated from the main news section. Obviously it doesn't make sense to reinvent your stack, but I'm surprised the surrounding site scaffolding isn't at least distinct to show this isn't primary news output.

seanwilson5y ago

Is there a link to a list of the style rules their checker tests for?

I've always felt automated checks + fixes for grammar and style are miles behind where they should be by now. Checking over and over e.g. long emails for problems before you send them is super time consuming, and that's not even considering help with tone and the overall message.

motohagiography5y ago

Funny, though I am unshocked that they have figured out a way to automatically generate cant.

What does make it interesting is if it were applied as a GPT-2/3 module, and let loose as a reddit comment bot to train a model for engagement and provocation. Editors are essentially model supervisors, and if the object is to provoke and flatter people to sell advertising, it seems more like a compute problem to distill this process into a business.

Human writers creating organic content aren't really necessary for that, and very soon we should be able to generate content and then attribute it to loyal personalities that we stand up as minor celebrities, not unlike the old Hollywood studio system from the early 20th century, where talent was well kept, but still very much kept.

lbill5y ago

Damn! That's the kind of thing my company could sell to its clients. And that's exactly what I'm going to tell my colleagues! Thanks a lot

melling5y ago

“ The rule application service is written in Scala, a common choice for Guardian backends”

They even have a snippet of Scala code. I feel like HN must be the target audience

lindig5y ago

Software in the 21st century: to check a bunch of regex on a text you need: Grafana, APIs, services. Really? I'm surprised there is no k8s in here. /s

duckmysick5y ago

I'll bite. How would you solve the following with the 20th century software without any API or services:

- regex rules are updated frequently (let's say weekly)

- the updates are available to hundreds if not thousands of users in different locations

- all of them have the latest ruleset

- all of them capable of sending feedback regarding how useful and correct the suggestions are

- said feedback is analyzed regularly and used to refine the ruleset

jamessb5y ago

I think the "20th century" solution to letting many users in different locations run regexes would probably have been to write a cgi-bin script in Perl, and have users paste text into a box on an HTML form.

The results page generated by the script could have checkboxes to mark each suggestion as useful/not-useful/incorrect and a submit button, with this feedback saved in MySQL.

(I'm not sure whether this qualifies as "without ... services")

jopsen5y ago

> Software in the 21st century

Early 21s century -- hopefully there is more to come :D

ggm5y ago

Grauniad and the fflong riots...

j / k navigate · click thread line to collapse

33 comments

mellosouls5y ago

For those unaware "Grauniad" is a decades-old nickname for The Guardian, used particularly by satirical mag Private Eye in reference to its reputation at one time for typos and the like.

https://wordhistories.net/2017/06/05/origin-of-grauniad/

triceratops5y ago

> reputation at one time for typos and the like

Well-earned into the present day. I regularly see typos even today.

foldr5y ago

I was recently moved to write to them on encountering the following completely unparsable sentence:

"People of ethnic group membership can change over time and with age."

They fixed it eventually (though I doubt it had anything to do with my comment).

There was also the following opinion piece, which still makes the utterly absurd claim that Jesse Jackson campaigned for the capitalization of 'African American':

https://www.theguardian.com/commentisfree/2020/oct/21/black-...

ggm5y ago

name me a tier-1 paper without typos. the point about the graun, is how entertaining they could be.

triceratops5y ago

2 more replies

mvidal015y ago

I bet some of those those typos came about because the type was hand set.

cormullion5y ago

not sure "hand" would be true, though:

https://www.theguardian.com/gnm-archive/gallery/2016/nov/18/...

modernerd5y ago

When I saw “13,000 regexes” I thought of the adage, “the plural of regex is regrets”.

But here it seems like a good choice to build on a battle-tested library of regrets, and it's clearly working well for them.

The demo looks slicker than the typical Grammarly/MS Word/native macOS grammar and spelling corrections, for those who missed it: https://www.youtube.com/watch?v=Yl0nb94N98k&feature=emb_imp_...

And the ability to flag false positives, send suggestions back, and see metrics of how the system's being used is just awesome.

jawns5y ago

teach5y ago

This is wonderful. I love to see technology enhancing experts' ability to do what they already do, but faster/more accurately.

smt885y ago

I'm good at reading/writing regex and use them a lot, but I always worry about their maintainability. They're a common source of hard-to-pinpoint bugs.

I suppose I still use them because I don't know of a better way to do things.

js_herbert5y ago

chimprich5y ago

Whoever came up with the name "Typerighter" for this project should feel very pleased with themselves.

kimburgess5y ago

[1]: https://github.com/place-labs/orthograph-err

[2]: https://textlint.github.io/

js_herbert5y ago

Thanks sharing these projects, other suggestions are very welcome – we'd be interested in adding new matchers based on different tech if they were a good fit for the use case.

RobAley5y ago

Will you (are you) contributing any of the rules back to TL? Or are they to specific to your org?

js_herbert5y ago

danpalmer5y ago

I had this same thought, but I wonder if it really matters for this use-case. The rules are actually quite simple much of the time – they're spelling and stylistic corrections.

dtrizzle5y ago

This project reminded me a proselint, which appears to be a similar style checker. Sadly, that project appears to have been inactive for at least three years.

https://github.com/amperser/proselint

mrkwse5y ago

I saw a journalist share Typerighter on Twitter and was intrigued, so I'm looking forward to reading this.

seanwilson5y ago

Is there a link to a list of the style rules their checker tests for?

motohagiography5y ago

Funny, though I am unshocked that they have figured out a way to automatically generate cant.

lbill5y ago

Damn! That's the kind of thing my company could sell to its clients. And that's exactly what I'm going to tell my colleagues! Thanks a lot

melling5y ago

“ The rule application service is written in Scala, a common choice for Guardian backends”

They even have a snippet of Scala code. I feel like HN must be the target audience

lindig5y ago

Software in the 21st century: to check a bunch of regex on a text you need: Grafana, APIs, services. Really? I'm surprised there is no k8s in here. /s

duckmysick5y ago

I'll bite. How would you solve the following with the 20th century software without any API or services:

- regex rules are updated frequently (let's say weekly)

- the updates are available to hundreds if not thousands of users in different locations

- all of them have the latest ruleset

- all of them capable of sending feedback regarding how useful and correct the suggestions are

- said feedback is analyzed regularly and used to refine the ruleset

jamessb5y ago

The results page generated by the script could have checkboxes to mark each suggestion as useful/not-useful/incorrect and a submit button, with this feedback saved in MySQL.

(I'm not sure whether this qualifies as "without ... services")

jopsen5y ago

> Software in the 21st century

Early 21s century -- hopefully there is more to come :D

ggm5y ago

Grauniad and the fflong riots...

j / k navigate · click thread line to collapse