Leveraging SIMD: Splitting CSV Files at 3Gb/S (opens in new tab)

(blog.tinybird.co)

89 points__exit__4y ago41 comments

41 comments

Pretty similar article from very recently: https://nullprogram.com/blog/2021/12/04/

Discussion: https://news.ycombinator.com/item?id=29439403

The article mentions in an addendum (and BeeOnRope also pointed it out in the HN thread) a nice CLMUL trick for dealing with quotes originally discovered by Geoff Langdale. That should work here for a nice speedup.

But without the CLMUL trick, I'd guess that the unaligned loads that generally occur after a vector containing both quotes and newlines in this version (the "else" case on lines 34-40) would hamper the performance somewhat, since it would eat up twice as much L1 cache bandwidth. I'd suggest dealing with the masks using bitwise operations in a loop, and letting i stay divisible by 16. Or just use CLMUL :)

davidm17294y ago

Hi, I'm one of the authors of the post

Thanks for pointing us to CLMUL, I'm not familiar with these kind of multiplications, but, converting the quote bitmask to a quoted bitmask would certainly make it faster. With this new bitmask, we could negate it and AND it with the newline mask, generating a mask of newlines that are not inside quotes. Getting the last newline then would be a simple CLZ of that mask. And there wouldn't be a need to resort to byte to byte processing.

In our tests, going byte to byte for more iterations to keep the alignment when hitting the "else case" performed worse than making the unaligned loads, but as you say "just use CLMUL" (as all loads will be aligned) :D

jart4y ago

PMOVMSKB/BSF/POPCNT takes serious wizardry, but instructions like PCLMULLQLQDQ make you feel like Gandalf. It's defined:

    pair clmul(uint64_t a, uint64_t b) {
      uint64_t t, x = 0, y = 0;
      if (a && b) {
        if (bsr(a) < bsr(b)) t = a, a = b, b = t; /* optional */
        for (t = 0; b; a <<= 1, b >>= 1) {
          if (b & 1) x ^= a, y ^= t;
          t = t << 1 | a >> 63;
        }
      }
      return (pair){x, y};
    }

There's a famous paper on how it can perform polynomial division at 40gbps. It's really cool that it has practical applications in things like CSV too. https://www.intel.com/content/dam/www/public/us/en/documents...

zwegner4y ago

CLMUL in general is a bit weird to wrap your head around, but a CLMUL with -1 isn't too tricky: it's like a running 1-bit sum, or in other words, each bit in the result is the parity of all the bits up to that point in the multiplier.

> In our tests, going byte to byte for more iterations to keep the alignment when hitting the "else case" performed worse than making the unaligned loads, but as you say "just use CLMUL" (as all loads will be aligned) :D

I was talking about using bitwise operations with the quote/escape/newline masks already computed (like in the blog post I linked), rather than a byte-by-byte loop. But yeah, CLMUL is better anyways :)

gpderetta4y ago

CLMUL is quite interesting. I learned about it when going in depth on how multiplications help with hashing.

A multiplication is in practice: - a sum over - a series (i.e. one for each bit set in the multiplier) - of shifts (where the shift amount is the index of that bit in the multiplier)

The shifting and the combining are great for hashing as they "distribute" each bit around.

CLMUL simply replaces the addition in step one with xor (which can also be thought as the single bit carryless addition).

mattewong4y ago

Even with the CLMUL trick, CSV parsing does not play nice. It can be made to work for JSON parsing because you can make more assumptions. With CSV, it only works smoothly if you are willing to accept a subset of what most spreadsheet programs accept i.e. to assume your CSV is "well-formed". Considering for example that the following three cells:

AA"A,"AA""A","A"A"A

when opened in Excel will all give you the same value, using CLMUL to normalize will require many repeated additional SIMD operations-- probably at least 8 if not more. At some vector size it will be worth it, but not clear at 256. The irony is, if you are stuck with CSV input, then the fact that you couldn't get a better format/encoding also suggests that you can't assume your CSV is "well-formed"

jart4y ago

That's not what Python and Google Sheets do.

    >>> list(csv.reader(['''AA"A,"AA""A","A"A"A'''], dialect='excel'))
    [['AA"A', 'AA"A', 'AA"A']]

Has the CSV format been standardized somewhere?

mattewong4y ago

What I can tell you is that if you save a file with 'AA"A,"AA""A","A"A"A' (excluding the surrounding single-quotes) and then double-click to open in Excel, you get 3 cells with the exact same values. Furthermore, if you run `echo 'AA"A,"AA""A","A"A"A' | xsv select 1,2,3` you again get the same 3 values. For people working with CSV, it's far more likely that the user cares more about consistency with Excel than consistency with some python lib or with Google sheets-- neither of which are used much, compared to Excel, in the worlds where CSV tends to reside (at least in my experience)

mattewong4y ago

Huh? your comment proves exactly what my prior post said, which is that you end up with 3 equal values that were each represented, in the input, in different ways. Looks like csv.reader + dialect=excel is doing exactly that.

zwegner4y ago

Good points all around. Not sure what the OP's requirements are, but judging by their current code, CLMUL should do nicely (or they have a bug).

And also, thanks for that example. Clearly I don't know CSV well enough--are quotes in fields that don't start with a quote not special?

mattewong4y ago

If their data comes from a controlled bubble and they need not assume "real-world" data, then CLMUL might do nicely but best case it would likely only be a marginal improvement (and even then I would be willing to bet, not at anything less than 512 bit vector sizes). Best case, it still needs additional vector calls to support quoting. Obviously, if no quoting will be supported, it's even simpler, but then you also cannot support commas or newlines inside of cell values and are getting so far from "CSV" that you might as well just say you have pipe-delimited data which happens to use comma instead of pipe in which case you don't need CLMUL. If quoting needs to be supported, CLMUL will still require a number of repeated passes, shifts etc to deal with the various cases including an escaped first quote char, last quote char, non-first-or-last quote char and embedded commas/newlines.

1 more reply

pclmulqdq4y ago

The carryless multiplication instructions are amazing and people should use them more often. They are just so poorly explained that they feel like magic.

jagrsw4y ago

Not sure how the author of this entry on HN managed to change original title from

gigabytes per second

gigabits per siemens

HPsquared4y ago

Staying with Physics, "Gb/S" is Gigabarns per Siemens. Some relation of electrical conductance with cross-sectional area.

The barn is a unit of cross-sectional area, based on the Uranium nucleus (area 1 barn). Uranium is pretty large in atomic terms; the name is from the idiom "couldn't hit the broad side of a barn".

jagrsw4y ago

And since 1/S = 1Ω, it'd be Gigabarnohm.

robertlagrant4y ago

Hence the old joke: how many Gigabarnohms does it take to start a circus?

Answer, of course, one million.

__exit__OP4y ago

Autocorrector issues + fast fingers to click on submit without double checking. Sorry for that.

Whoever fixed the title, thank you :D

CodesInChaos4y ago

> fixed the title

It still shows as "3Gb/S" for me, instead of "3GB/s"

jeltz4y ago

Isn't it easier to write 3GbΩ instead of 3Gb/S?

Sebb7674y ago

Probably auto-capitalization gone wrong. Or some very new code ;)

MisterTea4y ago

Genetic code most likely.

mattewong4y ago

Stay tuned for a SIMD powered CSV parser library and standalone utility about to drop this weekend. Alpha, but test showing it to be faster than anything else we could get our hands on

mattewong4y ago

As promised: https://github.com/liquidaty/zsv

liuliu4y ago

Splitting CSV file into chunks and process them independently won't necessarily be wrong (although there are implementations out there that I won't name would, because they do guess). The trick however requires to scan twice: https://liuliu.me/eyes/loading-csv-file-at-the-speed-limit-o...

Nice article otherwise!

michaelg7x4y ago

Presumably solving the same kind of delimiter-finding issues as Hyperscan? https://news.ycombinator.com/item?id=19270199

michaelg7x4y ago

I'm sorry, I don't mean Hyperscan, I mean simdjson [0]. I think I got confused by my recollection of Lemire/Langdale.

[0] https://github.com/simdjson/simdjson

Tuna-Fish4y ago

Why is the unit expression in topic messed up?

rwmj4y ago

Nice, but I'm afraid real world CSVs are a lot more complicated than described so don't use this code in production.

mschuster914y ago

If you're doing user-supplied CSVs, definitely... but if you are ingesting CSVs from a known source with known format (<insert audible sigh here>) it can definitely make sense to use a high-speed optimized ingester.

One might wonder if it might be worth the time to look into optimising the runtimes of various languages. I took a look, all operate on naive byte-by-byte scanning, and all sans PHP are written in the respective language which means any form of SIMD optimization is right off the table (okay, maybe something could be done in Java, but it seems incredibly complex, see https://www.morling.dev/blog/fizzbuzz-simd-style/):

- PHP isn't optimized anywhere, but at least it's C: https://github.com/php/php-src/blob/1c0e613cf1a24cdc159861e4...

- Python's C implementation is the same: https://github.com/python/cpython/blob/main/Modules/_csv.c

- Java doesn't have a "standard" way at all (https://www.baeldung.com/java-csv-file-array), and OpenCSV seems the usual object-oriented hell (https://sourceforge.net/p/opencsv/source/ci/master/tree/src/...).

- Ruby's CSV is native Ruby: https://github.com/ruby/ruby/blob/bd65757f394255ceeb2c958e87...

__s4y ago

Python's csv imports _csv for core functionality, which is C: https://github.com/python/cpython/blob/main/Modules/_csv.c

mschuster914y ago

Thanks! Updated accordingly.

1 more reply

clscott4y ago

Perl's best known library Terxt::CSV has both a pure-perl and a C implementation.

Here is the C version

https://github.com/Tux/Text-CSV_XS/blob/master/CSV_XS.xs

nickpeterson4y ago

It’s funny, csv files are so common and yet many mainstream languages don’t even attempt a decent parser baked in. I think dotnet has 3-4 different ones and as I recall they’re all pretty slow.

jwandborg4y ago

There's multiple dialects of CSV. Besides the more standardish dialect there are some weird ones that prevent some types of optimization. I remember Apple's "Enterprise Partner Feed" had a dialect I've never seen elsewhere so far. Columns were separated by 0x01, rows were separated by 0x02 0x0A.

The row separator being two bytes throws a wrench in most parsers.

2 more replies

j / k navigate · click thread line to collapse

41 comments

zwegner4y ago

Pretty similar article from very recently: https://nullprogram.com/blog/2021/12/04/

Discussion: https://news.ycombinator.com/item?id=29439403

davidm17294y ago

Hi, I'm one of the authors of the post

jart4y ago

PMOVMSKB/BSF/POPCNT takes serious wizardry, but instructions like PCLMULLQLQDQ make you feel like Gandalf. It's defined:

    pair clmul(uint64_t a, uint64_t b) {
      uint64_t t, x = 0, y = 0;
      if (a && b) {
        if (bsr(a) < bsr(b)) t = a, a = b, b = t; /* optional */
        for (t = 0; b; a <<= 1, b >>= 1) {
          if (b & 1) x ^= a, y ^= t;
          t = t << 1 | a >> 63;
        }
      }
      return (pair){x, y};
    }

zwegner4y ago

gpderetta4y ago

CLMUL is quite interesting. I learned about it when going in depth on how multiplications help with hashing.

A multiplication is in practice: - a sum over - a series (i.e. one for each bit set in the multiplier) - of shifts (where the shift amount is the index of that bit in the multiplier)

The shifting and the combining are great for hashing as they "distribute" each bit around.

CLMUL simply replaces the addition in step one with xor (which can also be thought as the single bit carryless addition).

mattewong4y ago

AA"A,"AA""A","A"A"A

jart4y ago

That's not what Python and Google Sheets do.

    >>> list(csv.reader(['''AA"A,"AA""A","A"A"A'''], dialect='excel'))
    [['AA"A', 'AA"A', 'AA"A']]

Has the CSV format been standardized somewhere?

mattewong4y ago

zwegner4y ago

Good points all around. Not sure what the OP's requirements are, but judging by their current code, CLMUL should do nicely (or they have a bug).

And also, thanks for that example. Clearly I don't know CSV well enough--are quotes in fields that don't start with a quote not special?

mattewong4y ago

1 more reply

pclmulqdq4y ago

The carryless multiplication instructions are amazing and people should use them more often. They are just so poorly explained that they feel like magic.

jagrsw4y ago

Not sure how the author of this entry on HN managed to change original title from

gigabytes per second

gigabits per siemens

HPsquared4y ago

Staying with Physics, "Gb/S" is Gigabarns per Siemens. Some relation of electrical conductance with cross-sectional area.

The barn is a unit of cross-sectional area, based on the Uranium nucleus (area 1 barn). Uranium is pretty large in atomic terms; the name is from the idiom "couldn't hit the broad side of a barn".

jagrsw4y ago

And since 1/S = 1Ω, it'd be Gigabarnohm.

robertlagrant4y ago

Hence the old joke: how many Gigabarnohms does it take to start a circus?

Answer, of course, one million.

__exit__OP4y ago

Autocorrector issues + fast fingers to click on submit without double checking. Sorry for that.

Whoever fixed the title, thank you :D

CodesInChaos4y ago

> fixed the title

It still shows as "3Gb/S" for me, instead of "3GB/s"

jeltz4y ago

Isn't it easier to write 3GbΩ instead of 3Gb/S?

Sebb7674y ago

Probably auto-capitalization gone wrong. Or some very new code ;)

MisterTea4y ago

Genetic code most likely.

mattewong4y ago

Stay tuned for a SIMD powered CSV parser library and standalone utility about to drop this weekend. Alpha, but test showing it to be faster than anything else we could get our hands on

mattewong4y ago

As promised: https://github.com/liquidaty/zsv

liuliu4y ago

Nice article otherwise!

michaelg7x4y ago

Presumably solving the same kind of delimiter-finding issues as Hyperscan? https://news.ycombinator.com/item?id=19270199

michaelg7x4y ago

I'm sorry, I don't mean Hyperscan, I mean simdjson [0]. I think I got confused by my recollection of Lemire/Langdale.

[0] https://github.com/simdjson/simdjson

Tuna-Fish4y ago

Why is the unit expression in topic messed up?

rwmj4y ago

Nice, but I'm afraid real world CSVs are a lot more complicated than described so don't use this code in production.

mschuster914y ago

- PHP isn't optimized anywhere, but at least it's C: https://github.com/php/php-src/blob/1c0e613cf1a24cdc159861e4...

- Python's C implementation is the same: https://github.com/python/cpython/blob/main/Modules/_csv.c

- Ruby's CSV is native Ruby: https://github.com/ruby/ruby/blob/bd65757f394255ceeb2c958e87...

__s4y ago

Python's csv imports _csv for core functionality, which is C: https://github.com/python/cpython/blob/main/Modules/_csv.c

mschuster914y ago

Thanks! Updated accordingly.

1 more reply

clscott4y ago

Perl's best known library Terxt::CSV has both a pure-perl and a C implementation.

Here is the C version

https://github.com/Tux/Text-CSV_XS/blob/master/CSV_XS.xs

nickpeterson4y ago

It’s funny, csv files are so common and yet many mainstream languages don’t even attempt a decent parser baked in. I think dotnet has 3-4 different ones and as I recall they’re all pretty slow.

jwandborg4y ago

The row separator being two bytes throws a wrench in most parsers.

2 more replies

j / k navigate · click thread line to collapse