How to fix CSV: make it even more U+1F4A9 PILE OF POO (opens in new tab)

(getgrist.com)

26 pointspaulfitz2y ago27 comments

27 comments

What if my data contains a new line? People focus on the comma then forget the newline is just as significant. That still needs to be escaped and we're right back where we started.

Meanwhile, RFC4180 takes less time to read than this entire article.

paulfitzOP2y ago

Escaping is needed no matter what separators are used, but if a character from the astral plane is always present (like U+1F4A9 PILE OF POO) then you can be pretty sure the software is handling unicode well and isn't corrupting cells without you noticing.

So true about RFC4180. Admittedly this post kind of got out a little early, support for the format was slated for the first of next month...

blackbeans2y ago

Ageeed, and the RFC4180 you are referring to also specifies how to escape newlines.

CSV is a simple storage format for data. Its simplicity, readability and portability makes it popular. I think that any attempt to improve it will be a failure.

I must say that CSV generally suffices for table data. The only annoyance is that internationally, there are differences in the use of the column separator, as the comma is often used as decimal separator. I think CSV should always be implemented with a comma as column separator and a dot as decimal separator, regardless of the country. But applications such as Excel do not accept this format internationally.

kristopolous2y ago

If we are willing to throw away the comma, use the ASCII RS, record separator symbol. It's exactly what you want, even has a visual ␞ these days.

It's a problem solved decades ago with solutions we've failed to adopt. Weird, buggy, poorly parsable CSV is still somehow the norm.

Not saying you should, but if you want to change, the answer is already there. Change has to start somewhere...

ogoffart2y ago

Not long ago there was also a post about "Unicode Separated Value" https://github.com/sixarm/usv https://news.ycombinator.com/item?id=39679378

makeworld2y ago

I was sure that's where this post was going to pivot to. The characters are right there!

paulfitzOP2y ago

The main problem is they are still in the basic multilingual plane, so U+1F4A9 PILE OF POO still has an edge there for tickling problems quickly.

paulfitzOP2y ago

Meant to link to this! Thanks, will update.

verandaguy2y ago

Technically editorialized title (since the original article just uses the emoji verbatim), but I think this is a net improvement.

Retr0id2y ago

Emojis are disabled on HN (although in exceptional circumstances, staff have been known to edit them in)

dbt002y ago

0x1d and 0x1e in the ascii standard exist for exactly this reason and don’t need more than one byte unlike this goofy thing.

spiffytech2y ago

According to the USV project, these characters don't work out so well in practice.

> We tried using the control characters, and also tried configuring various editors to show the control characters by rendering the control picture characters.

> First, we encountered many difficulties with editor configurations, attempting to make each editor treat the invisible zero-width characters by rendering with the visible letter-width characters.

> Second, we encountered problems with copy/paste functionality, where it often didn't work because the editor implementations and terminal implementations copied visible letter-width characters, not the underlying invisible zero-width characters.

>Third, users were unable to distinguish between the rendered control picture characters (e.g. the editor saw ASCII 31 and rendered Unicode Unit Separator) versus the control picture characters being in the data content (e.g. someone actually typed Unicode Unit Separator into the data content).

https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...

hhh2y ago

Why do the RS and GS characters get ignored so much? I have only seen them used in symbols on parts labels of one of the big 3 automakers.

hakfoo2y ago

A lot of payment processor APIs use FS and GS intensely; I suspect it was an lighter way to serialize structured data when your typical card-payment terminal was an 8-bit MCU with a few kilobytes of memory and a dial-up modem.

axelthegerman2y ago

That's great, how does a normal person type that? For a comma I got a key on my keyboard

gwbas1c2y ago

I don't think CSV can ever be "fixed." Its popular because there is always someone naive enough to think that it works, and ignorant of specs that handle corner cases.

gwbas1c2y ago

I think we should name the files ".cso"

CSO is a stormwater industry term for "Combined Sewer Overflow." They happen in older cities where storm runoff and raw sewage (poop) go into the same sewer system. When there is a lot of rain, the wastewater treatment plants overflow, and then raw sewage runs into waterways.

https://en.wikipedia.org/wiki/Combined_sewer#Combined_sewer_...

dsagal2y ago

That should make it the format of choice for Chief Security Officers.

paulfitzOP2y ago

This is excellent, thank you!

dwheeler2y ago

No! The poop symbol is used in data, and thus is a terrible separator. If you have to quote it anyway, use commas, as that is already in use. Or use "Unicode" separated values.

deathanatos2y ago

You could just apply the standard CSV escaping mechanism, so a single poop in a cell is represented in the file by 4 poop emoji. (Two to quote, one to escape, one as a literal. See RFC 4180 if you're still confused on "why 4?")

dwheeler2y ago

Sure, but we already have CSV. A different format should have some advantage over existing formats. If you have to escape it either way, then use the format people can already process.

ghusto2y ago

Perhaps naive, but we escape with \ everywhere else, so why not here?

If you're typing in CSV manually, escape with \

If you're exporting to CSV, the program already know which part is data and which part is the next cell, so again the program can escape with \

deathanatos2y ago

Because those of us that have to read your data would highly prefer you just emit standard¹ CSV, and not invent "CSV+my oddball customizations". If you're going to muck about outside the standard format, then you might as well just use DSV from the OP.

Most good implementations are flexible enough that they might be configurable to your proposed pseudo CSV. (Or even DSV. Or USV. Etc.) But I'd rather just not need to, and the sanest default for any CSV library is the standard format.

(Or even better … just emit newline-terminated JSON. Richer format, less craziness than CSV, parsers still abound.)

¹(RFC 4180. "," is field sep, CRLF is row sep. You can escape a comma or a CRLF by surrounding the entire field in double-quotes, and a double quote itself can be escaped by escaping the field, doubling the internal double quote.)

ghusto2y ago

"Oddball customisation" is a bit rich, no? A backslash is the way things are escaped in most places. Why re-invent the wheel?

And why would you "highly prefer you just emit standard CSV"? What is the benefit to insisting adherence to the original standard, especially if the modification fixes something that is broken?

deathanatos2y ago

> A backslash is the way things are escaped in most places.

Sure … but not by CSV. Backslash is hardly the only way, nor is the doubled-quote escape mechanism particularly obscure, given its presence in popular formats like CSV, SQL, or YAML.

> why would you "highly prefer you just emit standard CSV"?

Why would you prefer a baroque format?

So that it goes through a standard parser without needing extra, additional configuration. It's just CSV, and moreso, it's CSV without hoops or surprises that need to be transmitted out of band somehow.

> What is the benefit to insisting adherence to the original standard, especially if the modification fixes something that is broken?

That's not the case here, though: using backslash doesn't fix something that's broken.

refulgentis2y ago

https://webcache.googleusercontent.com/search?q=cache:https:...

n.b. not worth your time. tl;dr: lets replace the comma with the poop emoji because commas occur in data.

There's already a solution to that (obviously). Best argument a contrarian could make is you "learn about unicode", by which they'd mean, the words "basic multilingual plane" are included at one point.

j / k navigate · click thread line to collapse

27 comments

akira25012y ago

What if my data contains a new line? People focus on the comma then forget the newline is just as significant. That still needs to be escaped and we're right back where we started.

Meanwhile, RFC4180 takes less time to read than this entire article.

paulfitzOP2y ago

So true about RFC4180. Admittedly this post kind of got out a little early, support for the format was slated for the first of next month...

blackbeans2y ago

Ageeed, and the RFC4180 you are referring to also specifies how to escape newlines.

CSV is a simple storage format for data. Its simplicity, readability and portability makes it popular. I think that any attempt to improve it will be a failure.

kristopolous2y ago

If we are willing to throw away the comma, use the ASCII RS, record separator symbol. It's exactly what you want, even has a visual ␞ these days.

It's a problem solved decades ago with solutions we've failed to adopt. Weird, buggy, poorly parsable CSV is still somehow the norm.

Not saying you should, but if you want to change, the answer is already there. Change has to start somewhere...

ogoffart2y ago

Not long ago there was also a post about "Unicode Separated Value" https://github.com/sixarm/usv https://news.ycombinator.com/item?id=39679378

makeworld2y ago

I was sure that's where this post was going to pivot to. The characters are right there!

paulfitzOP2y ago

The main problem is they are still in the basic multilingual plane, so U+1F4A9 PILE OF POO still has an edge there for tickling problems quickly.

paulfitzOP2y ago

Meant to link to this! Thanks, will update.

verandaguy2y ago

Technically editorialized title (since the original article just uses the emoji verbatim), but I think this is a net improvement.

Retr0id2y ago

Emojis are disabled on HN (although in exceptional circumstances, staff have been known to edit them in)

dbt002y ago

0x1d and 0x1e in the ascii standard exist for exactly this reason and don’t need more than one byte unlike this goofy thing.

spiffytech2y ago

According to the USV project, these characters don't work out so well in practice.

> We tried using the control characters, and also tried configuring various editors to show the control characters by rendering the control picture characters.

> First, we encountered many difficulties with editor configurations, attempting to make each editor treat the invisible zero-width characters by rendering with the visible letter-width characters.

https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...

hhh2y ago

Why do the RS and GS characters get ignored so much? I have only seen them used in symbols on parts labels of one of the big 3 automakers.

hakfoo2y ago

axelthegerman2y ago

That's great, how does a normal person type that? For a comma I got a key on my keyboard

gwbas1c2y ago

I don't think CSV can ever be "fixed." Its popular because there is always someone naive enough to think that it works, and ignorant of specs that handle corner cases.

gwbas1c2y ago

I think we should name the files ".cso"

https://en.wikipedia.org/wiki/Combined_sewer#Combined_sewer_...

dsagal2y ago

That should make it the format of choice for Chief Security Officers.

paulfitzOP2y ago

This is excellent, thank you!

dwheeler2y ago

No! The poop symbol is used in data, and thus is a terrible separator. If you have to quote it anyway, use commas, as that is already in use. Or use "Unicode" separated values.

deathanatos2y ago

dwheeler2y ago

Sure, but we already have CSV. A different format should have some advantage over existing formats. If you have to escape it either way, then use the format people can already process.

ghusto2y ago

Perhaps naive, but we escape with \ everywhere else, so why not here?

If you're typing in CSV manually, escape with \

If you're exporting to CSV, the program already know which part is data and which part is the next cell, so again the program can escape with \

deathanatos2y ago

(Or even better … just emit newline-terminated JSON. Richer format, less craziness than CSV, parsers still abound.)

ghusto2y ago

"Oddball customisation" is a bit rich, no? A backslash is the way things are escaped in most places. Why re-invent the wheel?

And why would you "highly prefer you just emit standard CSV"? What is the benefit to insisting adherence to the original standard, especially if the modification fixes something that is broken?

deathanatos2y ago

> A backslash is the way things are escaped in most places.

Sure … but not by CSV. Backslash is hardly the only way, nor is the doubled-quote escape mechanism particularly obscure, given its presence in popular formats like CSV, SQL, or YAML.

> why would you "highly prefer you just emit standard CSV"?

Why would you prefer a baroque format?

> What is the benefit to insisting adherence to the original standard, especially if the modification fixes something that is broken?

That's not the case here, though: using backslash doesn't fix something that's broken.

refulgentis2y ago

https://webcache.googleusercontent.com/search?q=cache:https:...

n.b. not worth your time. tl;dr: lets replace the comma with the poop emoji because commas occur in data.

j / k navigate · click thread line to collapse