Lessons learned from doing the one billion row challenge (opens in new tab)

(foojay.io)

29 pointsanthony882y ago11 comments

11 comments

> My implementation ... All the results are incorrect. The station names should be sorted alphabetically but the last station showing is İzmir and it should be Zürich.

It is easy to forget, that names of places can have non-ASCII characters. As this is a speed contest, I wonder how slow the default library implementation for ordering unicode strings alphabetically is in Java?

Edit: Apparently there is no universal way to order words alphabetically, but it depends on the (human) language in question.

For example, İzmir is in Turkey and in Turkish alphabetical ordering the dotted capital İ comes after the dotless capital I. And in Turkish, Ö comes right after O, but for example in Swedish, the Swedish special letters Å, Ä and Ö are at the very end of the alphabet.

How are you supposed to deal with this in this contest? Are you somehow supposed to know that Özalp is a town in Turkey, and thus comes after O in alphabetical ordering, but Örebro is a town in Sweden and should be ordered to the very end of the alphabetical ordering, after Z and Å and Ä?

jerf2y ago

If you want the search term for this, it's "collation", since this is one of the odder such terms if you don't already know it. Here's Too Much Information about Unicode collation: https://www.unicode.org/reports/tr10/

sampo2y ago

I found an online version

https://icu4c-demos.unicode.org/icu-bin/collation.html

but there you first have to select the language from the drop-down menu. So in general, you would first need to know the country where your weather station was located, before you can correctly collate its name.

I don't believe the fastest entries are doing all this(?)

Edit: In the examples [1], the guy writes a Polish city using its English name "Cracow". So you can't choose the alphabetical ordering based on the geographical location of the weather station, but you need to somehow detect in which language its name is written in, in the data.

[1] https://www.morling.dev/blog/one-billion-row-challenge/

Edit2: I guess you could declare that either the "Default Unicode Collation Element Table (DUCET)", or perhaps the American English "en-US-u-va-posix" locale is the correct way to alphabetize.

1 more reply

hobs2y ago

Basically the question comes down to Collation - if you are comparing characters which comes first? That's all down to the choice of collation.

patmorgan232y ago

Obligatory link to the "Plain Text" conference talk.

https://youtu.be/gd5uJ7Nlvvo?si=5PZHV5n4jUEVcxlZ

kubb2y ago

JVM arguments as an optimization technique give me that winter melancholy.

NicoJuicy2y ago

This is weird. I've read about the 1brc and the first thing I remember is that you need to establish the base result in your pc vs. The metric and then normalize the results to that benchmark.

This post doesn't seem to take that into account.

netcraft2y ago

Is anyone doing the challenge on other platforms besides java?

flopsamjetsam2y ago

Rust: * https://github.com/gunnarmorling/1brc/discussions/57 * https://github.com/mtb0x1/1brc

My friend also claims he is going to try it in Prolog. We'll see :)

Yasuraka2y ago

I know of this one in Go, where the author goes through his loop of trial-and-profile with measurements and flamegraphs along the way.

https://www.bytesizego.com/blog/one-billion-row-challenge-go

j / k navigate · click thread line to collapse

11 comments

sampo2y ago

> My implementation ... All the results are incorrect. The station names should be sorted alphabetically but the last station showing is İzmir and it should be Zürich.

Edit: Apparently there is no universal way to order words alphabetically, but it depends on the (human) language in question.

jerf2y ago

sampo2y ago

I found an online version

https://icu4c-demos.unicode.org/icu-bin/collation.html

I don't believe the fastest entries are doing all this(?)

[1] https://www.morling.dev/blog/one-billion-row-challenge/

Edit2: I guess you could declare that either the "Default Unicode Collation Element Table (DUCET)", or perhaps the American English "en-US-u-va-posix" locale is the correct way to alphabetize.

1 more reply

hobs2y ago

Basically the question comes down to Collation - if you are comparing characters which comes first? That's all down to the choice of collation.

patmorgan232y ago

Obligatory link to the "Plain Text" conference talk.

https://youtu.be/gd5uJ7Nlvvo?si=5PZHV5n4jUEVcxlZ

kubb2y ago

JVM arguments as an optimization technique give me that winter melancholy.

NicoJuicy2y ago

This is weird. I've read about the 1brc and the first thing I remember is that you need to establish the base result in your pc vs. The metric and then normalize the results to that benchmark.

This post doesn't seem to take that into account.

netcraft2y ago

Is anyone doing the challenge on other platforms besides java?

flopsamjetsam2y ago

Rust: * https://github.com/gunnarmorling/1brc/discussions/57 * https://github.com/mtb0x1/1brc

My friend also claims he is going to try it in Prolog. We'll see :)

Yasuraka2y ago

I know of this one in Go, where the author goes through his loop of trial-and-profile with measurements and flamegraphs along the way.

https://www.bytesizego.com/blog/one-billion-row-challenge-go

j / k navigate · click thread line to collapse