Friends don't let friends export to CSV (opens in new tab)

(kaveland.no)

245 pointslervag2y ago440 comments

440 comments

This article seems written by someone who never had to work with diverse data pipelines.

I work with large volumes of data from many different sources. I’m lucky to get them to send csv. Of course there are better formats, but all these sources aren’t able to agree on some successful format.

Csv that’s zipped is producible and readable by everyone. And that makes is more efficient.

I’ve been reading these “everyone is stupid, why don’t they just do the simple, right thing and I don’t understand the real reason for success” articles for so long it just makes me think the author doesn’t have a mentor or an editor with deep experience.

It’s like arguing how much mp3 sucks and how we should all just use flac.

The author means well, I’m sure. Maybe his next article will be about how airlines should speak Esperanto because English is such a flawed language. That’s a clever and unique observation.

hn_throwaway_992y ago

Totally agree. His arguments are basically "performance!" (which is honestly not important to 99% of CSV export users) and "It's underderspecified!" And while I can agree with the second, at least partly, in the real world the spec is essentially "Can you import it to Excel?". I'm amazed at how much programmers can discount "It already works pretty much everywhere" for the sake of more esoteric improvements.

All that said (and perhaps counter to what I said), I do hope "Unicode Separated Values" takes off. It's essentially just a slight tweak to CSV where the delimiters are special unicode characters, so you don't have to have complicated quoting/escaping logic, and it also supports multiple sheets (i.e. a workbook) in a single file.

diogocp2y ago

> in the real world the spec is essentially "Can you import it to Excel?"

And the answer to that is always no. You will it think it's yes because it works for you, but when you send it to someone who has a different Excel version or simply different regional settings, it won't work. The recipient will first have to figure out what dialect you used to export.

Freak_NL2y ago

Oh absolutely. Don't forget about the giant mess Microsoft made in countries like the Netherlands, where Excel will refuse to automatically open comma-separated values files (you know, CSV), unless the separator is a semicolon — because someone in the past thought that was how the Dutch did it.

You want people to be able to open your nice and sensible CSV files in Excel? First figure out which arcane locale specific settings you need, then generate those monstrosities and annoy anyone who genuinely expected Unicode comma-separated values.

My solution was to just write a generic spreadsheet writer component in our API and have it generate CSV (normal, comma-separated, UTF-8), XLSX, or ODS. Anyone using Excel just grabs the XLSX output. Generating XLSX and ODS was just a matter of saving a minimal example file for each and figuring out where to write cells and rows. No extra library needed (both file formats are just zipped up XML files), and everybody is happy.

3 more replies

chaps2y ago

"Always no"? What. Really not sure what you mean here when you agree that the possibility of it working exists.

I get the sentiment -- when I request data through FOIA, I will almost always request it as "an excel format" because I know that I'll at least be able to import it. CSV is much less of a guarantee and will have issues -- missing quotes, wrong delimiters, inconsistent column counts, things like that. So requesting "an excel format" implies "make the best effort to give me something that will load in my excel, but without asking what version of excel I have". Removes a fair amount of hassle, especially when it took months to get the data. It also means that if they fuck up the columns by doing a conversion, you have some means of telling them that the data is simply wrong, rather than the data is hard to work with. It does mean dealing with [0-9]+GB sized excel files sometimes, though.

That all said, I prefer to share CSV files. Haven't had much of a problem with it and I can guarantee some consistency from my side. Like, the people I share files with aren't going to know what the hell a parquet file is. A CSV though? Just double click it if you're less technical, or open it up in less if you can use a terminal. It usually compresses well, despite what the author wrote.

1 more reply

Pikamander22y ago

I've been amazed by how much better LibreOffice is at importing CSVs in a sane manner than Excel. Its CSV import prompt is nothing short of the gold standard and puts Excel to shame.

Also, even if the CSV format is completely valid, Excel will still find a way to misinterpret some of your cells in baffling ways, destroying the original data in the process if you don't notice it in time.

3 more replies

Stranger432y ago

At this point i suspect excel is as dangerous as powerpoint to the quest for sharing and improving human knowledge, in that it gives the illusion of allowing people to analyze data but not the tools to make sure they get their data and model lined up correctly.

1 more reply

pif2y ago

>> in the real world the spec is essentially "Can you import it to Excel?"

> And the answer to that is always no.

Sorry, you are wrong! You are confusing "No" and "Yes, after a quick setup at most".

1 more reply

mcv2y ago

Yeah, I definitely prefer CSV over Excel. Excel messes with your data. CSV is what it is. Sure, you may not know exactly what that is, and plenty of times I've had to figure out manually how to handle a specific CSV export, but once I've figured that out, it will work. With Excel, there's always the risk that a new value for a string field could also be interpreted as a date.

alexvoda2y ago

On the other hand you can now use Power Query to import perquet data into Excel.

1 more reply

goatlover2y ago

If the answer was always no, importing CSVs to Excel wouldn't be an expectation or widely used.

2 more replies

djbusby2y ago

ASCII has had field and record separators since like, forever. Wish we had kept using those.

wombatpm2y ago

No you don’t. It’s a holdover from when files were on tapes. The logic is all inverted too. Record separators were at the beginning of a record. Group and unit separators could then be nested. You really needed a spec for whatever document you were trying to makt

3 more replies

josefx2y ago

Does it have nesting operators? I want to embedd ASCII within my ASCII fields. So I can have a table within my table.

3 more replies

lolive2y ago

https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...

1 more reply

ryncewynd2y ago

Are there unicode characters specifically for delimiters?

If Excel had a standardised "Save as USV" option it would solve so many issues for me.

I get so many broken CSVs from third-parties

hiccuphippo2y ago

ASCII has characters for unit, record, group and file separator. And a some days ago there was a story here about using the unicode printable representation of these for an editor friendly format.

https://news.ycombinator.com/item?id=39679378

crq-yml2y ago

There are characters from ASCII for delimiting records, which are underused because they cause confusion about whether they should be represented as a state change like a backspace character, or as a glyph. See also: "nobody can agree on the line ending character sequence".

The USV proposal uses additional codepoints introduced in Unicode for the representation of the record delimiters, so they will always look and edit like character glyphs, and nobody is using them for some other purpose. The standardized look of these glyphs is unappealing, and they aren't easy to type, but it's fixable with a font and some editing functions.

Most of the issue hinges on Excel support.

1 more reply

bhaney2y ago

Hell, there's ASCII characters specifically for delimiters. 0x1C to 0x1F are respectively defined as file, group, record, and unit separators. Unicode naturally inherits them all.

1 more reply

fbdab1032y ago

My significantly bigger beef would be all of the auto-formatting Excel does to mangle data. Excel loves to turn entries into dates.

Human genes had to be renamed so as to avoid this Excel features.

1 more reply

hn_throwaway_992y ago

Yes, this recent discussion has lots of good info and links in the comments: https://news.ycombinator.com/item?id=39679378

lostlogin2y ago

> Are there unicode characters specifically for delimiters?

We could use the HL7 pipe ‘|’ and all enjoy that hell.

3 more replies

Dalewyn2y ago

>I do hope "Unicode Separated Values" takes off. It's essentially just a slight tweak to CSV where the delimiters are special unicode characters

Commas can be typed by anyone on any keyboard and readable by anyone.

Special Unicode Characters(tm) can't be typed by anyone on any keyboard and readable by noone.

Convenience is a virtue.

hn_throwaway_992y ago

I can't remember the last time I, or anyone I've ever worked with for that matter, ever typed up a CSV from scratch. The whole point of USV is that the delimiters can't normally be typed so you don't have to worry about escaping.

USV supports displayable delimiters (see https://github.com/SixArm/usv), so for the much more common case of editing an existing CSV in a text editor, you can just copy and paste.

2 more replies

deathanatos2y ago

> Special Unicode Characters(tm) can't be typed by anyone on any keyboard and readable by noone.

While I'm not a fan of USV, I do believe it is type-able on both macOS and Linux¹. The IME's character picker in both OSes contains all of the necessary characters, I think. (I use "␤" in commit messages, occasionally. That was a typed ␤, not copy/pasted from a Google query or such.)

It would be quite tedious, I do admit; reasonably typed by someone, probably not.

I don't normally type out CSVs by hand either, though.

(¹I don't know about Windows as I simply haven't cared about it in years. For all I know, they've grown an IME in the intervening decade.)

3 more replies

medstrom2y ago

The IANA standard for TSV already disallows tabs inside fields, so you can skip writing any quoting logic (in principle). The MIME type is `text/tab-separated-values`.

https://www.iana.org/assignments/media-types/text/tab-separa...

chrisandchris2y ago

So true. Working with imports/exports in CSV from ERP software. One can't imagine how often "Oh, this import doesn't work. I'll just fix the CSV file" occurs. Try that with some compressed, "esoteric" file, or even CML and users will break it.

Besides all the downsides CSV has, as soon as it's not only machine-machine communication and a human is involved, CSV is just simole enough.

xeromal2y ago

Check out Polars in python if you want some CSV performance lol. I recently got a 5 million row CSV from a 3rd party and I could manipulate columns (filtering, sorting, grouping) in actions that took less than a second. It's an incredible tool.

paulddraper2y ago

USV has a mountain of problems.

And really is in search of a problem to solve.

palmfacehn2y ago

JSON objects as a CSV field has been mostly agreeable for my usage. It would be nice if some of the spreadsheet apps displayed the object tree.

puppymaster2y ago

ditto. if it fits in the ram, file types don't matter.

likis2y ago

It seems like you missed the conclusion in the article. If users want CSV exports, let them have it.

If you have important data being shuffled around systems, pick something with a specification instead.

whakim2y ago

To me this criticism feels excessive. It feels like the author is describing their frustrations with internal usage of CSVs - there's no mention of customers and non-technical stakeholders at all. I think it goes without saying that Parquet files and other non-human-readable formats are a nonstarter when working with external stakeholders and the last paragraph makes that clear - if the end-user wants CSV, give them CSV.

I also think we shouldn't blindly dismiss the performance drawbacks of CSV when working with data pipelines. At even modest scales it becomes hard to work with very large CSVs because the data often doesn't fit into memory, a problem easily solved by Parquet and other formats assuming you only need a subset.

prepend2y ago

I deal with gig size csvs all the time and don’t have any performance issues. These aren’t huge files, but decent sized. And most are just a few megs and only thousands to millions of records.

Csv is not very performant, but it doesn’t matter for these use cases.

I’ll also add that I’m not working with the csvs, they are just I/o. So any memory issues are handled by the load process. I certainly don’t use csvs for my internal processes. Just for when someone sends me data or I have to send it back to them.

That being said my workstation is pretty big and can handle 10s of gigs of csv before I care. But that’s usually just for dev or debugging and anything that sticks around will be working with data in some proper store (usually parquet distributed across nodes).

whakim2y ago

That may be your experience, but certainly not a universal experience (and apparently not the author's, either). In my experience, it's pretty easy to have CSVs (or Parquet files, or whatever) that are tens or hundreds of GBs in size. The space savings from a more modern file format are significant, as is the convenience of being able to specify and download/open only a subset of rows or columns over the network. Most of us don't have workstations with 50GB of RAM, because it's far more cost-effective to use a Cloud VM if you only occasionally need that much memory.

That being said, the real point here is that folks blindly use CSVs for internal-facing processes even though there's no particular reason to, and they have plenty of drawbacks. If you're just building some kind of ETL pipeline why wouldn't you use Parquet? It isn't as if you're opening stuff in Excel.

1 more reply

Certhas2y ago

If I run a simulation workload it's pretty easy to generate gigabytes of data per second. CSV encoding adds a huge overhead space and time wise, so saving trajectories to disc for later analysis can easily become the bottleneck.

I have had many other situations where CSV was the bottleneck.

I still would default to CSV first in many situations because it's robust and easily inspected by hand.

smcin2y ago

> That being said my workstation is pretty big and can handle 10s of gigs of csv before I care.

How much RAM do you have? What's the ratio of [smallest CSV file which bottlenecks]/[your RAM]?

1 more reply

strken2y ago

I very much agree with this. For an integration where you have control over both ends of the pipeline, CSV is not optimal unless there's existing work to build on, and even then it's a legacy choice.

Parquet and Avro are widely supported in backend languages and also in data analysis. I don't think the article is talking about exported-like-a-jpeg, but instead exported-like-a-daily-report-run: the data scientist doing the exporting is probably using R or Pandas instead of Excel, and can reasonably be expected to read https://arrow.apache.org/docs/r/reference/read_parquet.html.

jgord2y ago

btw, xsv has solved most of my problems dealing with 'large' 40GB csv files

throwaway20372y ago

xsv? I never heard of it. This one? https://github.com/BurntSushi/xsv

If yes, looks very cool. Plus, bonus HN/Internet points for being written in Rust!

1 more reply

vmchale2y ago

its parser is buggy! https://github.com/BurntSushi/xsv/issues/337

(I ran into this issue myself)

1 more reply

chaps2y ago

Why is this getting downvoted? They're right that the criticism is pretty excessive:

  "Maybe his next article will be about how airlines should speak Esperanto because English is such a flawed language. That’s a clever and unique observation."

Hm.

prepend2y ago

I got a little snarky but I think the analogy holds.

Esperanto is a superior language to English. And English has many flaws.

Theoretically it would be better to have all pilots and airports learn an efficient language.

But it would be stupid and immature to seriously write a blog post about that, especially without talking about all the flaws in that plan.

2 more replies

gxs2y ago

Agree.

Not saying csv doesn’t have its issues, but I don’t think the author made a convincing argument.

A lot of the issues the author brought up didn’t sound that bad and/or it sounds like he never looked at the source data first.

If you’re doing work with large datasets, I think it’s a good practice to at least go and look at the source data briefly to see what to expect.

This will give you a good idea of the format it outputs, data types, some domain context, etc. or some combination thereof and I don’t think it even takes that long.

Also, it reminds me of the arguments against excel in a way. Most people know what a csv is, more or less how to open it, and don’t need too much context when discussing the file. Someone will quickly understand if you tell them the file isn’t delimited properly right away. These are pros that shouldn’t be taken for granted.

Again, I’m not saying csv doesnt have issues or that there aren’t better alternatives, simply that I didn’t find this particular argument convincing.

mrgoldenbrown2y ago

IME most people don't know that using Excel to open and save a csv will silently mangle data. In our application leading zeros are significant, so we constantly get screwed by people trying to do quick manual edits and breaking the data. If we're lucky it breaks so badly the import fails. It's worse when the mangling results in structurally valid but wrong data.

gxs2y ago

I think what you’re saying is accurate, but it’s also important to be practical about stuff.

These are pretty well know excel limitations by now.

And really, anyone using excel who is somehow not aware of that limitation is probably not someone yet experienced enough to be working on a larger and/or mission critical dataset to begin with.

Are there exceptions? Sure. You might be tempted to cite the example of the incident where this happened to some biologists not too long ago, but mistakes happen. I’ve seen people make mistakes building android or iPhone using the right (TM) tools.

What is the exact number of mistakes where you make the decision to jump to a new format?

I’m not sure. This does happen eventually, but the author didn’t make a strong case here imo.

likis2y ago

But the point is that you don't have to look at the source data if you have an actual specification and defined format, right?

eviks2y ago

> all these sources aren’t able to agree on some successful format.

But the same is true for csv, and they are not readable by everyone since you don't always know how to read them, there is not enough info for that

Also it's not a good reflection on "deep experience" if it leads to reflexive defense of common stupid things people do with wrong analogies (e.g, flac is less efficient, so more like csv)

prepend2y ago

In my experience csv has the fewest problems. Not that it has zero problems.

lobochrome2y ago

For me, the giveaway was:

"You give up human readable files,..."

I was genuinely interested in some alternative suggestions - but the human readableness of csv is what makes it so sticky imo.

DontchaKnowit2y ago

My entire exoerience with software development has been me bellyaching about how stupidly things are setup, why dont we do it this way instead etc... only to actually set about working on fixing these things and reqlizing its either way harder than I thought, it makes more sense than I thought, or it just plumb isnt worth the effort.

The evervescent suggestions of a brighter more logical, even obvious, solutions, is often a clear indicator of domain inexperience or ignorance.

Oioioioiio2y ago

I worked plenty enough with 'diverse data pipelines' and most of them were shit due to other companies just not knowing how to work properly.

CSV created tons of issues regarding encoding, value separation etc.

I started talking to our customers and were able to define interfaces with better and aligned format. json made my life easier.

netcan2y ago

So...

In some senses, I think internet culture (maybe modern intellectual culture generally) gets stuck in these repetitive conversations.

Reprosecuting without seemingly knowing about all the previous times the conversation has been had.

mannyv2y ago

And it's surprisingly hard for etl departments to export to csv correctly. I mean, if they can't do csv they can't do anything more complicated for sure.

iainmerrick2y ago

This article seems written by someone who never had to work with diverse data pipelines

I think that's a little unfair, it sounds like the author does have a decent amount of experience working with real-world CSV files:

I remember spending hours trying to identify an issue that caused columns to "shift" around 80% into a 40GB CSV file, and let me tell you, that just isn't fun.

fennecbutt2y ago

Yup, csv is always the best fallback, imo. It's: easily generated, easily parsed, human readable/editable, compact, portable, list goes on.

msla2y ago

> Csv that’s zipped is producible and readable by everyone. And that makes is more efficient.

If only CSV were CSV, as opposed to some form that's 80-90% CSV by line count with enough oddities to really make the parser ugly and hard to read.

See, the sweet spot isn't something completely unstructured, because then you feel justified in throwing up (your hands) and declaring defeat. The sweet spot is a file that's sufficiently close to being structured you can almost parse it nicely, but has enough bad lines you can't go in and fix them all by hand in a reasonable timeframe, and you can't tell upstream to get their shit in order because it's only a few lines.

prepend2y ago

There’s definitely hair to deal with and it’s a little messy, but it’s never a blocker.

But I’d say the error rate is actually very low, maybe .1-1% and nowhere near 10-20% of data being messed up.

wolrah2y ago

> But I’d say the error rate is actually very low, maybe .1-1% and nowhere near 10-20% of data being messed up.

The thing with CSV-related issues is it's usually not a fixed percentage but instead depends on the data.

I work in the VoIP industry so I deal with the FreePBX Asterisk GUI quite often, and it uses CSV as its bulk import/export format. This mostly makes sense as the data is almost entirely (with one notable exception) simple tables that fit nicely in to rows and columns. The issue I run in to most commonly with this is that it doesn't quote numerical fields, and as a result the fields for extension numbers and voicemail PINs can be problematic when they contain one or more leading zeroes. All of the major spreadsheet software I've used defaults to dropping leading zeroes from columns they've decided contain numerical values, and this results in broken data in these cases. It's of course relatively rare for users to choose a voicemail PIN starting with zero and even more rare for extensions to be set up with a leading zero, but both happen regularly enough that I need to remember to manually configure those columns as "Text" when opening an extension export CSV.

Either way, how often the problem occurs depends entirely on the data being sent through this pipeline. Most sites will never see the problem on the extension column, but one of my sites where the company liked a user's extension to be the last four of their DID when they were initially set up 20 years ago has a dozen of them in a row.

nrdvana2y ago

Depends on your tools, I suppose. I'd just like to share this:

https://metacpan.org/pod/Data::TableReader::Decoder::IdiotCS...

likis2y ago

Did you actually read the conclusion at the end of the article?

"Of course, we can't conclude that you should never export to CSV. If your users are just going to try to find the quickest way to turn your data into CSV anyway, there's no reason why you shouldn't deliver that. But it's a super fragile file format to use for anything serious like data integration between systems, so stick with something that at the very least has a schema and is more efficient to work with."

hackerlight2y ago

It's a premature optimization issue. If you don't have special requirements like IO throughput, or mission critical data accuracy guarantees, be biased towards picking the format that anyone can easily open in a spreadsheet.

eviks2y ago

You can open out easily, but just as easily it can be wrong. So with this bias you'd still not export csv, you'd use xls

tdudhhu2y ago

"Of course there are better formats, but all these sources aren’t able to agree on some successful format."

It's the same with csv. They come in all kinds of formats because nobody agreed on the standard. Comma separated, semicolon separated, pipe separated, escaped, not escaped.

Everytime I have to deal with csv I first have to figure out how to parse it in code.

So I think the author is right, we must agree on a better format because that is what friends do.

You are also right because it's an illusion to think that this is going to change anythime soon. But who knows..

iamflimflam12y ago

Every integration I’ve ever worked on has started off with high ideas of APIs and nice data standards. And has eventually devolved into “can we just put a CSV file on an FTP site…”. With the inevitable, “it’s not really CSV…”

gerdesj2y ago

... And what's more, you'll be an Engineer my son.

rossdavidh2y ago

"You give up human readable files, but what you gain in return is..." Stop right there. You lose more than you gain.

Plus, taking the data out of [proprietary software app my client's data is in] in csv is usually easy. Taking the data out in Apache Parquet is...usually impossible, but if it is possible at all you'll need to write the code for it.

Loading the data into [proprietary software app my client wants data put into] using a csv is usually already a feature it has. If it doesn't, I can manipulate csv to put it into their import format with any language's basic tools.

And if it doesn't work, I can look at the csv myself, because it's human readable, to see what the problem is.

90% of real world coding is taking data from a source you don't control, and somehow getting it to a destination you don't control, possibly doing things with it along the way. Your choices are usually csv, xlsx, json, or [shudder] xml. Looking at the pros and cons of those is a reasonable discussion to have.

TimTheTinker2y ago

I think his arguments apply more closely to SQLite databases. They're not directly human readable, but boy are there a lot of tools for working with them.

askvictor2y ago

We have a use case where we effectively need to have a relational database, but in git. The database doesn't change much, but when it does, references between tables may need to be updated. But we need to easily be able to see diffs between different versions. We're trying an SQLite DB, with exports to CSV as part of CI - the CSV files are human-readable and diff'able.

It's also worth noting that SQLite can ingest CSV files into memory and perform queries on them directly - if the files are not too large, it's possible to bypass the sqlite format entirely.

thenbe2y ago

> we need to easily be able to see diffs between different versions

Can git attributes help in this case? It allows you to teach git how to diff binary files using external tools. Here [0] is a demonstration for teaching git to produce an "image diff" for *.png files using exiftool. You can do something similar for *.sqlite files by adding these 3 lines [1] [2]. The sqlite3 cli needs to be installed.

Alternatively, there's a tool that might also fit the bill called datafold/data-diff [3]. I'm pretty sure I originally heard of it on a HN thread so those comments may offer even more alternative solutions.

[0]: https://youtu.be/Md44rcw13k4?t=540 [the relevant timestamp is @ 9:00]

[1]: https://github.com/kriansa/dotfiles/blob/7a8c1b11b06378b8ca8...

[2]: https://github.com/kriansa/dotfiles/blob/7a8c1b11b06378b8ca8...

[3]: https://github.com/datafold/data-diff

zachmu2y ago

Somebody already said this, but we built exactly this and it's called Dolt.

https://github.com/dolthub/dolt

Would love to hear how it addresses your use case or falls short.

john-shaffer2y ago

Have you considered https://github.com/dolthub/dolt for your use case?

ultrarunner2y ago

Real world example of this that we just experienced:

I work with a provider who offers CSV exports as the only way to access data. Recently, we found they were including unsanitized user input directly in fields. They weren't even quoting these fields.

The platform "notified their quality assurance team ASAP" (like every other issue, we never heard back), but we had a deadline. This, of course, was a mess, but being able to quickly open the file and fix quotes was all it took. I shudder at the thought of trying to salvage a corrupted binary with zero help from the holder of the data.

Too2y ago

This sounds like a problem that wouldn’t have existed in the first place if following a binary protocol with a standard format and using a proper serialization library.

The issue comes from CSV files looking easy to generate by hand, when it in fact is not.

2 more replies

saulrh2y ago

In my experience, human readable file formats are a mistake. As soon as people can read a single file they think that that's the entire format and that it's okay to write it by hand or write their own code for it. And when everyone writes code based on the just what they've personally seen about a format, everyone is sad. This is why not a single piece of software on earth uses the CSV RFC. This is why people hand you CSVs that don't quote string fields with commas in them. This is why you find software that can't handle non-comma delimiters. This is why you find software that assumes that any field made of digits is an integer and then crashes when it tries to do string operations on it. This is why you find software that can't be used unless you change your computer's locale because the front end understands locales and uses commas for numbers but the backend is running on a server and doesn't know what locales are and now everything is broken. This has happened for every single "human-readable" format in existence: html, markdown, CSV, rtf, json, everything. I consider human readability to be a death knell for a format's chances of broad, easy interoperability. There are certainly other things that can doom a format - parquet is almost too complex to implement and so only just barely works, for example - but I'll take a sqlite database over a csv every single time.

asddubs2y ago

I think a takeaway could also be not to give people options when making a human-readable format. "you always need quotes, they're not optional" solves the comma problem. "the delimiter is always a comma" solves the delimiter problem. json has also fared better than csv, I'd say.

emj2y ago

That makes the delimiter "," which is ugly so someone is just going to use , instead and you are back to square one.

1 more reply

GuB-422y ago

As a French, there is another problem with CSV.

In the French locale, the decimal point is the comma, so "121.5" is written "121,5". It means, of course, that the comma can't be used as a separator, so the semicolon is used instead.

It means that depending whether or not the tool that exports the CSV is localized or not, you get commas or you get semicolons. If you are lucky, the tool that imports it speaks the same language. If you are unlucky, it doesn't, but you can still convert it. If you are really unlucky, then you get commas for both decimal numbers and separators, making the file completely unusable.

There is a CSV standard, RFC 4180, but no one seems to care.

layer82y ago

There are tools to convert between the formats. Either you have a defined data pipeline where you know what you get at each step and apply the necessary transformations. Or you get random files and, yes, have to inspect them and see how to convert them if necessary.

It’s unfortunate that there isn’t a single CSV format, but for historical reasons it is what it is. It’s effectively more like a family of formats that share the same file extension.

Excel actually has a convention where it understands when there is a line

   sep=;

at the start of the file.

By the way, in addition to differing separators, you can also get different character encodings.

Excel does understand a BOM to indicate UTF-8, but some versions of Excel unfortunately ignore it when the “sep=“ line is present…

al_borland2y ago

>sep=;

Thank you! This is a game changer. As I was reading through these comments I was thinking how much better it would be if the separator could be specified in the file, but it would only be useful for my own stuff, if I was to do that on my own.

I’ll be trying this first thing tomorrow at work. I don’t do as much with CSVs as I used to, but am currently working with them a lot, and have a single entry that always throws the resulting Excel file off, and I can’t be bothered to figure out how to escape it (I spent too much time in the past messing with that and got nowhere). This sep= lines will hopefully solve my issues.

heresie-dabord2y ago

I have sometimes distributed a CSV file with a similarly-named text file that contains key-value pairs to aid in the use of the CSV file.

A minimal text file of this type would contain:

    #key=value
    sep=;

eyegor2y ago

What kind of voodoo is this? I've always wanted something like this for non technical coworkers across countries but I didn't know it existed. I always just exported TSV and provided steps for how to import it (although I think most excel versions have native tsv support).

Gabriel542y ago

If I'm not mistaken this is pretty universal outside of the US (and maybe the UK).

sfRattan2y ago

Going by the Wikipedia article and included map, use of comma versus period as decimal separators is roughly an even split:

https://en.wikipedia.org/wiki/Decimal_separator

https://commons.wikimedia.org/wiki/File:DecimalSeparator.svg

Larrikin2y ago

Seems geographically split, but I wonder what is the actual population split is. Most of the top 10 population countries use the decimal separator. Only Brazil, Russia and Indonesia don't.

Maybe someone with a CSV of the world populations and a CSV of the countries broken down by their separator can do that comparison.

fshr2y ago

There's definitely a big distribution disparity. 11 of the 15 most populous countries use the period for decimals.

1 more reply

zztop442y ago

You are mistaken. Probably more countries overall use a decimal comma, but the decimal point is used as convention in many countries, including China, India, Nigeria and the Philippines.

echoangle2y ago

Does any tool seriously localize CSVs?

elric2y ago

Any serious CSV tool has the option to pick a delimiter. Usually semicolon or comma, some offer additional options. The only impact it has is on which fields need quoting. When using comma, all decimals in many langugaes need to be quoted. When using semicolon those don't need to be quoted.

Overall, semicolon feels like the superior delimiter.

Most sensible people don't export formatted numbers (e.g. 100.000,00), but even those are pretty trivial to import.

nlitened2y ago

So many tools do that, even (most importantly) Excel does

kmmlng2y ago

In my experience, this is only a problem when you are using Excel. It's ridiculous how bad Excel is at handling CSVs, I really cannot comprehend it. If you use LibreOffice all your problems magically disappear.

henriquecm82y ago

I find out that Excel follows windows locale, you can change the decimal and the thousand separator on windows, and it will affect how excel exports and reads CSVs.

stephenr2y ago

Isn't this exactly what quoting solves?

i.e.: ``` "1,20","2,3",hello "2,40","4,6",goodbye ```

If your tool reads CSV by doing `string_split(',', line);`, your tool is doing it wrong. There's a bunch of nuance and shit, which can make CSVs interesting to work with, but storing a comma in a field is a pretty solved issue if the tool in question has more than 5 minutes thought put into it.

ric2b2y ago

Now all your numbers are strings

1 more reply

junto2y ago

Same in German. We have things like 1.500.021,92 (I.e. 1,500,021.92)

gerdesj2y ago

If only a totally separate data field separator character had been invented early on and been given its own key on the keyboard, coloured: "only use for field delimiting". You know as well as I do that it would have been co-opted into another rôle within weeks, probably as a currency indicator.

You should probably use PSV - Point Separated Variable! Obviously we would need to adjust PSV to account for correct French word order (and actually use French correctly). Britain and USA would use something called PVS2 instead as a data interchange format with France which involves variable @@@ symbols as delimiters, unless it is a Friday which is undefined. The rest of the world would roll its eyes and use PSV with varying success. A few years later France would announce VSP on the back page of Le Monde, enshrine its use in law (in Aquitaine, during the fifteenth century) but not actually release the standard for fifteen years.

The world is odd. We have to work within an odd world.

Interestingly enough, you and I could look at a CSV encoded data set with commas as decimal separators and work out what is going on. You'll need a better parser!

lolive2y ago

Don’t look at the ASCII table, at entries 28 to 31. You are not mentally ready for what you will find there.

junto2y ago

> ..what you will find there

Made me laugh, especially since you can’t “see” them.

In fact, that was maybe an oversight when ASCII was designed, but maybe there was a reason for that. If they were visible, and were actually recognizable as separator types then people would know them better.

https://www.lammertbies.nl/comm/info/ascii-characters

frob2y ago

My company has been using DSV for a bit: Dagger Separated Values. Unicode dagger (†) to separate values and double-dagger (‡) to indicate end of row. It allows us to easily handle both commas and newlines.

msla2y ago

> In the French locale, the decimal point is the comma, so "121.5" is written "121,5". It means, of course, that the comma can't be used as a separator

Heh. Ah, HN. Always good for a laugh line.

samatman2y ago

Yeah, hon hon hon and all, but one of my (US) bank statements exports a CSV which uses commas in numbers in the US fashion, so $1,500 and the like. Writing a custom CSV munger to intake that into ledger-csv was... fun, but then again, only had to do it once.

stephenr2y ago

Surely if they're putting commas in values they were quoting the values though?

1 more reply

msla2y ago

That's exactly my point, and I still got downvoted.

This place...

Closi2y ago

Of course if you only consider the disadvantages, something looks bad.

The advantages of CSV are pretty massive though - if you support CSV you support import and export into a massive variety of business tools, and there is probably some form of OOTB support.

anymouse1234562y ago

This is the biggest win IME.

You have a (usually) portable transport format that can get the information into and out of an enormous variety of tools that do not necessarily require a software engineer in the middle.

I'm also struggling with such a quick dismissal of human readable formats. It's a huge feature.

What happens when there's a problem with a single CSV file in some pipeline that's been happily running fine for years? You can edit the thing and move on with your day. If the format isn't human readable, now you may have to make and push a software update to handle it.

Of course, CSV is a terrible format that can be horribly painful. No argument there.

But despite the pain, it's still far better than many alternatives. In many situations.

chasil2y ago

In a POSIX shell, I actually prefer to use the bell character for IFS.

  while IFS="$(printf \\a)" read -r field1 field2...
  do ...
  done

This works just as well as anything outside the range of printing characters.

Getting records that contain newlines would be a bit trickier.

quesera2y ago

Heaven help you if you cat the file in a shell, though!

BenjiWiebe2y ago

I think IFS=$'\a' works too.

1 more reply

PeterisP2y ago

You can do tab-separated "CSV" and it'll be much better, avoid the quoting and delimiter issues that somehow trip up something half the time, and pretty much all these tools have always supported that format as well.

wlesieutre2y ago

If only the ASCII 31 "unit separator" were well supported

nicbou2y ago

I also like that Preview can display CSV

breadwinner2y ago

The reason CSV is popular is because it is (1) super simple, and (2) the simplicity leads to ubiquity. It is extremely easy to add CSV export and import capability to a data tool, and that has come to mean that there are no data tools that don't support CSV format.

Parquet is the opposite of simple. Even when good libraries are available (which it usually isn't), it is painful to read a Parquet file. Try reading a Parquet file using Java and Apache Parquet lib, for example.

Avro is similar. Last I checked there are two Avro libs for C# and each has its own issues.

Until there is a simple format that has ubiquitous libs in every language, CSV will continue to be the best format despite the issues caused by under-specification. Google Protobuf is a lot closer than Parquet or Avro. But Protobuf is not a splitable format, which means it is not Big Data friendly, unlike Parquet, Avro and CSV.

arp2422y ago

> Parquet is the opposite of simple. Even when good libraries are available (which it usually isn't), it is painful to read a Parquet file. Try reading a Parquet file using Java and Apache Parquet lib, for example.

I skimmed their docs a bit: https://parquet.apache.org/docs/

I would not look forward to implementing that.

It all seems rather complex, and even worse: not actually all that well described. I suppose all the information is technically there, but it's really not a well-design well-written specification that's easy to implement. The documentation seems like an afterthought.

This is probably why good libraries are rare.

munchler2y ago

Thanks for the link. I couldn't even get past this part:

> Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem

Nope and nope.

arp2422y ago

Yeah, it all seems very application specific. I mean, for starters columnar storage isn't really appropriate for a lot of data. That's perfectly fine! Nothing wrong with any of this. Just means it's not a great candidate for a general application-agnostic data-exchange format.

Jensson2y ago

> But Protobuf is not a splitable format, which means it is not Big Data friendly, unlike Parquet, Avro and CSV.

What, when I worked at Google concatenating protobuf strings was a common way to concatenate protobufs, they are absolutely splittable. People might not know it but there is a reason they are designed like they are, it is to handle big data as you say.

If you mean you can't split a single protobuf, sure, but you don't need to do that, it is like trying to split a single csv line, doesn't mean csv isn't splittable.

Edit: Maybe some of that tooling to work with protobufs are internal only or not a part of the external protobuf packages though.

mordae2y ago

> designed like they are

I laughed. We don't call somebody writing a poor varint serializer for in-house use, then discovering that it doesn't handle negative numbers and floats that well so slapping a couple of hotfixes on top of it that make it necessary to have a protocol specification file, "design".

breadwinner2y ago

Protobuf does not have built-in delimiters or sync markers between records, which makes it not possible to start reading from an arbitrary point in the middle of a Protobuf-encoded file and correctly interpret the data. That makes Protobuf not a splitable format.

msla2y ago

> there are no data tools that don't support CSV format.

They support CSV but not your CSV.

For example, how does quoting work? Does quoting work?

rovr1382y ago

I was working with a vendor’s csv recently…

They had never had a customer do X on Y field, so they never quoted it nor added code to quote it if needed..

Of course, we did X in one entry. Took me too long to find that which obviously messed up everything after.

neonsunset2y ago

Avro is a terrible serialization format. Well, not necessarily the spec but all tooling around it is. It should never be picked over multitude of other, better options unless the company is, maybe, a Java shop.

kerkeslager2y ago

> Google Protobuf is a lot closer than Parquet or Avro. But Protobuf is not a splitable format, which means it is not Big Data friendly, unlike Parquet, Avro and CSV.

Eh, I don't think that's the problem. If that was the problem, there are a zillion ways to chunk files; .tar is probably the most ubiquitous but there are others.

The bigger problem is that Protobuf is way harder to use. Part of the reason CSV is underspecified is it's simple enough it feels, at first glance, like it doesn't need specification. Protobuf has enough dark corners that I definitely don't know all of it, despite having used it pretty extensively.

I think Unicode Separated Values (USV) is a much better alternative and as another poster mentioned, I hope it takes off.

rkaveland2y ago

Author here. I see now that the title is too controversial, I should have toned that down. As I mention in the conclusion, if you're giving parquet files to your user and all they want to know is how to turn it into Excel/CSV, you should just give them Excel/CSV. It is, after all, what end users often want. I'm going to edit the intro to make the same point there.

If you're exporting files for machine consumption, please consider using something more robust than CSV.

smcin2y ago

Well what would be a more accurate title? "CSV format should only be for external interchange or archival; columnar formats like Parquet or Arrow better for performance"?

People are busy; instead of hinting "something more robust than CSV", mention the alternatives and show a comparison (load time/search time/compression ratio) summary graph. (Where is the knee of the curve?)

There's also an implicit assumption to each use-case about whether the data can/should fit in memory or not, and how much RAM a typical machine would have.

As you mention, it's pretty standard to store and access compressed CSV files as .csv.zip or .csv.gz, which mitigates at least trading off the space issue for a performance overhead when extracting or searching.

The historical reason a standard like CSV became so entrenched with business, financial and legal sectors is the same as other enterprise computing; it's not that users are ignorant; it's vendor and OS lock-in. Is there any tool/package that dynamically switches between formats internally? estimates comparative file sizes before writing? ("I see you're trying to write a 50Gb XLSX file...") estimates read time when opening a file? etc. Those sort of things seem worth mentioning.

rkaveland2y ago

> Well what would be a more accurate title? "CSV format should only be for external interchange or archival; columnar formats like Parquet or Arrow better for performance"?

Something more boring, like "Consider whether other options make more sense for your data exports than CSV". Plenty of people have suggested other good options in comments on this submission, such as for example sqlite. I think the post comes off as if I'm trying to sell a particular file format for all uses cases, when what I had in mind when writing it was to discourage using CSV as a default. CSV has a place, certainly, but it offloads a lot of complexity on the people who are going to consume the data, in particular, they need to figure out how to interpret it. This can't necessarily be done by opening the file in an editor and looking at it, beyond a certain size you're going to need programming or great tools to inspect it anyway.

I was given an initial export of ~100 poor quality CSV files totaling around 3TB (~5-6 different tables, ~50 columns in each) in size a few years back, and had to automate ingestion of those and future exports. We could've saved a lot of work if the source was able to export data in a friendlier format. It happened more than once during that project that we were sent CSVs or Excel sheets that had mangled data, such as zip codes or phone numbers with leading 0s removed. I think it is a good thing to inform people of these problems and encourage the use of formats that don't necessity guessing data types. :shrug:

> People are busy; instead of hinting "something more robust than CSV", mention the alternatives and show a comparison (load time/search time/compression ratio) summary graph. (Where is the knee of the curve?)

This might be an interesting thing to follow up later, but would require a lot more work.

cryptonector2y ago

> I see now that the title is too controversial, I should have toned that down.

Sometimes a click-baity title is what you need to get a decent conversation/debate going. Considering how many comments this thread got, I'd say you achieved that even if sparking a lengthy HN thread had never been your intent.

fifilura2y ago

Congratulations for getting the article upvoted and don't be too hard on yorself.

mannyv2y ago

I got a parquet file once and I was like WTF is this format?

The problem with parquet is it's complicated and you basically have to remap from parquet to whatever you're importing into because the people on the other side have remapped from whatever to parquet.

There are likely relationships and constraints there that you'll have to hack around - which is harder to do because the parquet tools sort of suck/aren't as flexible.

With CSV you can hack around any problem in the ETL process.

rietta2y ago

Or export to CSV correctly and test with Excel and/or LibreOffice. Honestly CSV is a very simple, well defined format, that is decades old and is “obvious”. I’ve had far more trouble with various export to excel functions over the years, that have much more complex third-party dependencies to function. Parsing CSV correctly is not hard, you just can’t use split and be done with it. This has been my coding kata in every programming language I’ve touched since I was a teenager learning to code.

barrkel2y ago

CSV is not well-defined. Data in the wild doesn't even agree that it's comma separated.

String encoding? Dates? Formatted numbers? Booleans (T/F/Y/N/etc)? Nested quotes? Nested CSV!?

How about intermediate systems that muck things up. String encoding going through a pipeline with a misconfiguration in the middle. Data with US dates pasted into UK Excel and converted back into CSV, so that the data is a mix of m/d/yy and d/m/yy depending on the magnitude of the numbers. Hand-munging of data in Excel generally, so that sometimes the data is misaligned WRT rows and columns.

I've seen things in CSV. I once wrote an expression language to help configure custom CSV import pipelines, because you'd need to iterate a predicate over the data to figure out which columns are which (the misalignment problem above).

Eddy_Viscosity22y ago

If the use-case has these complexities, then other formats may be better. I'll go out on a limb and say that MOST data export to csv are simple column data where it works just fine - at least that's been my experience.

aragilar2y ago

The problem is is that it is simple until it's not, then you have problems. You need to validate immediately after export, and work out what happens when that fails.

1 more reply

wruza2y ago

Intermediate systems like Excel will break anything, they aren’t constrained to CSV. Excel screws up at the level of a cell value, not at the file format.

barrkel2y ago

On the topic of nested CSV, three approaches:

- treat it as a join, and unroll by duplicating non-nested CSV data in separate rows for every element in the nested CSV

- treat it as a projection, have an extraction operator to project the cells you want

- treat it as text substitution problem; I've seen CSV files where every line of CSV was quoted like it was a single cell in a larger CSV row

You get nested CSV because upstream systems are often master/detail or XML but need to use CSV because everybody understands CSV because it's such a simple file format. Good stuff.

croes2y ago

Seems more like a problem of the exporting and importing software and not the format.

CSV gives you the freedom of choice how to write the data.

You need the french data format? You can write the french date format. You need # as separator? You can use # as separator.

Sure you can't read any file without knowing it's specification but that's not the usecase of CSV.

CSV is for data transfer between systems which know each other.

joncrocks2y ago

Indeed. It's like 'text file' - there are many ways to encode and decode these.

Add another munging the to list, ids that 'look like numbers' e.g. `0002345` will get converted to `2,345`. Better be sure to pre-pend ' i.e. `'0002345`

SAI_Peregrinus2y ago

Most of the problems come from treating CSV fields as something other than US-ASCII text strings. They're not dates, they're not numbers, they're just freeform text. If you want fancy-pants features like "numbers" or the ability to represent all common words in use in the US (e.g. the California town Rancho Peñasquitos) and aren't a masochist, you don't want CSV.

gregw22y ago

And “do you have a header row”? And null/nil/nul/blank specification/expectation?

fragmede2y ago

Yeah but we're not in the dark ages of computers anymore. Export to Sqlite database instead.

barrkel2y ago

Sure, you tell the finance industry that. They have systems, the systems already produce CSV. They can sign a contract worth multiples of your salary if you can consume it. Do you want the money or not?

1 more reply

Macha2y ago

> Parsing CSV correctly is not hard, you just can’t use split and be done with it.

Parsing RFC-compliant CSVs and telling clients to go away with non-compliant CSVs is not hard.

Parsing real world CSVs reliably is simply impossible. The best you can do is heuristics.

How do you interpret this row of CSV data?

1,5,The quotation mark "" is used...,2021-1-1

What is the third column? The RFC says that it should just be literally

> The quotation mark "" is used...

But the reality is that some producers of CSVs, which you will be expected to support, will just blindly apply double quote escaping, and expect you to read:

The quotation mark " is used...

Or maybe you find a CSV producer in the wild (let's say... Spark: https://spark.apache.org/docs/latest/sql-data-sources-csv.ht...) that uses backslash escaping instead.

wruza2y ago

The RFC 4180 says:

5. Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.

Macha2y ago

The point is your clients and partner firms don't actually care what RFC 4180 says, they just expect you to deal with whatever their CSV library spits out.

mrgoldenbrown2y ago

4180 is just one of many CSV variants. You can't simply examine a CSV and determine for sure which flavor of file you were given, you can only guess.

yetihehe2y ago

For added fun, last column should be 1-2-2021.

pezezin2y ago

At least in that case you know that the last part is the year. It is much funnier when you encounter something like 3-4-17 and you don't know if it is d/m/y, m/d/y, or y/m/d.

martinflack2y ago

Oh that's easy, it's Janreburary Firscond 2021.

rietta2y ago

The deliminator is a setting that can be changed. I never said hard code. This is an interesting exercise to give students, but I am struggling to look back in my mind through the last 25 years of line of business application development where any of this was intractable. My approach has been to leverage objects. I have a base class for a CsvWriter and CsvReader that does RFC compliant work. I get the business stake holders to provide samples of files they need to import or export. I look at those with my eyes and make sub-classes as needed.

And data type influencing is a fun side project. I worked for a while on a fully generic CSV to SQL converter. Basically you end up with regex matches for different formats and you keep a running tally of errors encountered and then do a best fit for the column. Using a single CSV file is consistent with itself for weird formatting induced by whatever process the other side used. It actually worked really well, even on multi gigabyte CSV files from medical companies that one of my clients had to analyze with a standard set of SQL reports.

rietta2y ago

Y'all caught me being less than rigorous in my language while posting from my phone while in the middle of making breakfast for the kids and the wife to get out the door. To the "not well defined" aspect, I disagree in part. There is an RFC for CSV and that is defined. Now the conflated part is the use of CSV as an interchange format between various systems and locales. The complexity of data exchange, in my mind, is not the fault of the simple CSV format. Nor is it the fault of the format that third party systems have decided to not implement it as it is defined. That is a different challenge.

Data exchange via CSV becomes a multiple party negotiation between software (sometimes the other side is "set in stone" such as a third party commercial system or Excel that is entrenched), users, and users' local configuration. However, while not well defined none of these are intractable from a software engineering point of view. And if your system is valuable enough and needs to ingest data from enough places it is NOT intractable to build auto detection capabilities for all the edge cases. I have done it. Is it sometimes a challenge, yes. Do project managers love it when you give 8 point estimates for what seems like it should be a simple thing, no. That is why the coding kata RFC compliance solution as a base first makes the most sense and then you troubleshoot as bug reports come in about data problems with various files.

If you are writing a commercial, enterprise solution that needs to get it right on its own every time, then that becomes a much bigger project.

But do you know what is impossible, getting the entire universe of other systems that your customers are using to support some new interchange format. Sorry, that system is no longer under active development. No one is around to make code changes. That is not compatible with the internal tools. For better or worse, CSV is the lingua franca of columns of data. As developers, we deal with it.

And yes, do support Excel XLSX format if you can. There are multiple third party libraries to do this in various languages of various quality.

As a developer, I have made a lot of my living dealing with this stuff. It can be a fun challenge, frustrating at time, but in the end as professionals we solve what we need to to get the job done.

dtech2y ago

Unfortunately that is only the case as long as you stay within the US. for non-US users Excel has pretty annoying defaults, such as defaulting to ; instead of , as a separator for "CSV", or trouble because other languages and Excel instances use , instead of . for decimal separators.

A nice alternative I've used often is to constructor an excel table and then giving is an .xls extension, which Excel happily accepts and has requires much less user explanation than telling individual users how to get Excel to correctly parse a CSV.

rietta2y ago

Localization is a thing. None of this is a show stopper. Subclasses and configuration screen for import and export.

Macha2y ago

And then you get a business analyst at your client going "I just hit export in our internal tool, what's this delimiter that you're asking me about in the upload form? Google Sheets doesn't ask me to tell them that"

1 more reply

rrr_oh_man2y ago

> Unfortunately, that is only the case as long as you stay outside the US. For US users Excel has pretty annoying defaults, such as defaulting to , instead of ; as a separator for "CSV", or trouble because US instances of Excel use . instead of , for decimal separators.

VMG2y ago

FTA:

* What does missing data look like? The empty string, NaN, 0, 1/1-1970, null, nil, NULL, \0?

* What date format will you need to parse? What does 5/5/12 mean?

* How multiline data has been written? Does it use quotation marks, properly escape those inside multiline strings, or maybe it just expects you to count the delimiter and by the way can delimiters occur inside bare strings?

And let me add my own question here:

what is the actual delimiter? Do you support `,`, `;` and `\t`?

fredguth2y ago

What is the encoding of the text file? UTF8, windows-1252?

What is the decimal delimiter “.”, “,”?

Most csv users don’t even know they have to be aware of all of these differences.

SAI_Peregrinus2y ago

The main issue is that "CSV" isn't one format with a single schema. It's one format with thousands of schemas and no way to communicate them. Every program picks its own schema for CSVs it produces, some even change the schema depending on various factors (e.g. the presence or absence of a header row).

RFC 4180 provides a (mostly) unambiguous format for writing CSVs, but because it discards the (implied) schema it's useless for reading CSVs that come from other programs. RFC 4180 fields have only one type: text string in US-ASCII encoding. There are no dates, no decimal separators, no letters outside the US-ASCII alphabet, you get nothing! It leaves the option for the MIME type to specify a different text encoding, but that's not part of the resulting file so it's only useful when downloading from the internet.

1 more reply

cm21872y ago

One example that will kill loading a csv in excel beyond the usual dates problem. If you open in excel a csv file that has some large id stored as int64, they will be converted to an excel number (I suspect a double) and rounded. Also if you have a text column but where some of the codes are numeric with leading zeros, the leading zeros will be lost. And NULL is treated as the string "NULL".

I am aware you can import a csv file in excel by manually defining the column types but few people use that.

I'd be fine with an extension of the csv format with one extra top row to define the type of each column.

JumpCrisscross2y ago

> aware you can import a csv file in excel by manually defining the column types but few people use that

And what fraction of those users would be able to anything with another format?

> an extension of the csv format with one extra top row to define the type of each column

If the goal is foolproof export to Excel, use XLSX.

frizlab2y ago

Excel will literally use a different value separator depending on the locale of the machine (if the decimal separator for numbers is a comma and not a dot, it’ll use a semicolon as a value separator instead of a comma).

orwin2y ago

I will hard disagree here. Always have has a clrf issue or another weirdness come up.

Especially if you work with teams from different countries, csv is hell. I always generate rfc compliant csv, not once it was accepted from day one. Once, it took us two weeks to make it pass the ingestion process (we didn't have access to the ingest logs and had to request them each day, after the midnight processing) so in the end, it was only 10 different tries, but still.

I had once an issue with json (well, not one created by me) , and it was clearly a documentation mistake. And I hate json (I'm an XML proponent usually). Csv is terrible.

mr_toad2y ago

> Parsing CSV correctly is not hard

Parsing the CSV you have in front of you is not hard (usually). Writing a parser that will work for all forms of CSV you might encounter is significantly harder.

alserio2y ago

Excel and data precision are really at odds. But it might be interesting to see what would happen if excel shipped with parquet import and export capabilities

kalleboo2y ago

> Honestly CSV is a very simple, well defined format, that is decades old and is “obvious”

This is the problem though. Everything thinks it is "obvious" and does their own broken implementation where they just concatenate values together with commas, and then outsources dealing with the garbage to whoever ends up with the file on their plate.

If a more complex, non-obvious format was required, instead of "easy I'll just concatenate values" they might actually decide to put engineering into it (or use a library)

ravenstine2y ago

> Parsing CSV correctly is not hard, you just can’t use split and be done with it.

And yet you can anyway if you are confident that your CSV won't contain anything that would mess it up.

delfinom2y ago

CSV is so simple that if someone forgets to send you escaped CSV, you can and should simply smack them with a shoe.

himinlomax2y ago

> well defined format

No.

prepend2y ago

CSV is very durable. If I want it read in 20 years, csv is the way to go until it’s just too big to matter.

Of course there are better formats. But for many use cases friends encourage friends to export to CSV.

josephg2y ago

Eh. I much prefer to produce and consume line delimited JSON. (Or just raw JSON). Its easy to parse, self descriptive and doesn't have any of CSV's ambiguity around delimiters and escape characters.

Its a little harder to load into a spreadsheet, but in my experience, way easier to reliably parse in any programming language.

brian-bk2y ago

JSON(newline delimited or full file) is significantly larger than csv. With csv the field name is mentioned once in the header row. In JSON every single line repeats the field names. It adds up fast, and is more of a difference than between csv to parquet.

ianburrell2y ago

That is only if the JSON uses objects. JSON arrays map much better to CSV. In that case, only adding brackets to the front and end of each line.

The ability of JSON to do both objects and arrays is useful, for example the first line can be an object or array of objects describing the fields. Then there is less confusion between schema lines and data lines like there is with CSV.

josephg2y ago

Compression will make the size overhead disappear instantly.

I do see your point - I know that its less efficient, and its not the best format if you're handling it every day or using it for huge data sets. But for a quick and dirty handoff between programs its lovely. It takes ~5 lines to parse in just about any programming language. And you can do so without pulling in any extra dependencies.

Looking at the downvotes I can see that its a controversial choice. But I stand by it.

VHRanger2y ago

If you send someone JSON there's no guarantee the data is tabular, or even formatted, though

ravenstine2y ago

Better something that can be parsed than not parsable at all?

1 more reply

danirod2y ago

Friends don't let friends export to CSV -- in the data science field.

But outside the data science field, my experience working on software programming these years is that it won't matter how beautiful your backoffice dashboards and web apps are, many non-technical business users will demand at some point CSV import and/or export capabilities, because it is easier for them to just dump all the data on a system into Excel/Sheets to make reports, or to bulk edit the data via export-excel-import rather than dealing with the navigation model and maybe tens of browser tabs in your app.

xupybd2y ago

Exactly Excel is the UI they know. This trumps every technical argument you can come up with. People don't want to throw out 20 years of experience with a tool to use your custom UI.

mrgoldenbrown2y ago

But why the half measure of csv when it's just as simple to use a library to export to an actual excel file (which is really just xml) , which will properly preserve your data and make the business users happy.

_trampeltier2y ago

CSV wins because its universal and very simple. With an editor like Notepad++ and the CSV plugin, reformating, like change date format, is very easy and even with colored columns.

eska2y ago

It's strange to me that people complain about some variety in CSV files while acting as if parquet was one specific file format that's set in stone. They can't even decide which features are core, and the file format has many massive backwards-incompatible changes already. If you give me a parquet file I cannot guarantee that I can read it, and if I produce one I cannot guarantee that you can.

I treat formats such as parquet as I generally do: I try to allow various different inputs, and produce standard outputs. Parquet is something I allow purely as an optimization. CSV is the common default all of my tools have (UTF-8 without BOM, international locale, comma separator, quoting at the start of the value optional, standards-compliant date format or unix timestamps). Users generally don't have any issue with adapting their files to that format if there's any difference.

orthoxerox2y ago

That's true. Parquet went through the weirdest changes between its various revisions and because it was used for Hadoop data lakes, there's a whole bunch of data that is being stored in legacy formats. Off the top of my head:

- different physical types to store timestamps: INT96 vs INT64

- different ways to interpret timestamps before tzdb (current vs earliest tzdb record)

- different ways to handle proleptic Gregorian dates and timestamps

- different ways to handle time zones (since Parquet only has the equivalents of LocalDateTime and Instant, but no OffsetDateTime or ZonedDateTime and earlier versions of Hive 3 were terribly confused which is which)

- decimal data type was written differently, as a byte array in older versions and as int/byte array/binary in the newer ones

- Hadoop ecosystem doesn't support decimals longer than 38 digits, but the file format supports them

JumpCrisscross2y ago

> One of the infurating things about the format is that things often break in ways that tools can't pick up and tell you about

This line is emblematic of the paradigm shift LLMs have brought. It’s now easier to build a better tool than change everyone’s behaviour.

> You give up human readable files, but

What are we even doing here.

MilStdJunkie2y ago

I think this article is causing a kerfuffle because it's hitting two different audiences very differently. Honestly the article title should have been "Friends don't let friends use CSV for data pipelines".

Because when I'm wrangling data from a human - a human who is stubbornly defending their own little island of business information like their employment depended on it[1] - a CSV is about as good as I am gonna get.

I had a bear of a time just convincing people to put their data in a delimited format, instead of a table inside a powerpoint presentation, or buried in sixty levels of Access joins, or in an SVG. I need data from "what is scroll wheel" sort of users.

If I am working system to system? That's a different requirement, a requirement that is apparently where the OP author is coming from.

[1] Because it kind of does. Having a unique platform is one of those priceless keys to being skipped in the thrice-yearly layoff rituals. Unfortunately, that means anyone approaching saying words like "integration" or "API" are shot on sight.

jitl2y ago

We’re doing data pipelines. I would rather my data pipeline go 10x faster with Parquet than be able to human read a 30gb CSV file.

th0ma52y ago

The vast majority of the thread is missing this point. The CSV abuse is out of control! Lol

rzzzt2y ago

So you grab your favorite CllaumistraCoderPT-3.14-Turbo and ask the box to deduce CSV settings from the given example. It comes back with a set of characteristics others have brought up elsewhere in comments (field separator, decimal handling, quotes, date format, etc.)

How do you verify it? What happens next?

quesera2y ago

Every single use I've ever seen of CSV would be improved by the very simple change to TSV.

Even Excel can handle it.

It is far safer to munge data containing tabs (convert to spaces, etc), than commas (remove? convert to dots? escape?).

The better answer is to use ASCII separators as Lyndon Johnson intended, but that turns out to be asking a lot of data producers. Generating TSV is usually easier than generating CSV.

jillesvangurp2y ago

And as a practical matter, Google sheets can export tsv. It didn't always have csv as an export (it does now). I've used Google sheets as a business friendly way to edit data for many years. With a little bit of instruction, this can work pretty well. Usually it just involves me telling people to not get creative with the column names and leave them alone.

This may shock some people but Excell and other MS Office products aren't that common any more to have around for developers. Through the nineties, MS had a defacto monopoly on desktop back in the day. But these days, a lot of developers use macs and they are also popular with business people. Google docs seems popular with startups. All companies I've been in for the last 12 years default to that.

Anyway, I've done this on a few teams where it just short cuts the whole discussion about needing a bespoke UI to edit some table of stuff for managers. I've even replaced a failed project to build such a UI with simple stuff like this. All you need is a bit of validation when importing and you can keep the last know good export in git and do pull requests to update the file. There's a weird dynamic where giving this level of control to some managers actually makes them feel more involved and engaged.

My preferred format to work with is actually ndjson (newline delimited json). Mainly because it's easier to deal with nested objects and lists in that. Whenever people start putting lists in table cells or start adding dots or other separators to their column names to indicate some kind of hierarcy, ndjson is the better solution. I've seen all sorts of attempts by business people to stuff non tabular data into a table.

croes2y ago

Don't change the data by replacing the separator. If the data contains a comma, tab etc. I want to receive a comma, tab etc.

With proper escaping the separator doesn't matter.

After that csv generating is pretty easy.

quesera2y ago

Right, the problem with CSV is that it's often generated incorrectly.

Commas do not need to be escaped inside quoted blocks. But now quoted blocks are not literal data, and you need to drop the enclosing quotes. And escape quotes inside them, or outside if the quote is literal. Also quotes must be balanced inside a delimited field. Etc.

Commas are just too common of a character to be a good delimiter.

Tabs are an improvement, because usually tabs are just whitespace, which is usually arbitrary, and never needs to be balanced.

But tabs are not perfect either. Just a vast improvement for most data, and a functional compromise for humans.

croes2y ago

BTW is Excel better in reading UTF8 files?

Last time I checked it can't read it without errors if you just open the file with it without using the import function an explicitly select UTF8.

Kwpolska2y ago

The trick is using UTF-8 with BOM.

chirau2y ago

you are assuming the regular end user knows the difference between 4 spaces and a tab and the nuances that come with them trying to replace one with the other or why space between two values is different at one point from the next.

Commas are, by far, better delimeters than tabs in the grand scheme of things and with both expert and regular users considered.

quesera2y ago

I disagree. The end user doesn't need to know the difference between 4 spaces and a tab. Tabs are just whitespace.

Tabs are uncommon but convenient whitespace. Commas are extremely common content. Tabs are a vastly better delimiter.

If you are wrapping source code in a CSV, a) you're doing it wrong, and b) you'll get bitten by newlines just as quickly! If you're including content that requires specific whitespace preservation, just escape the (usually rare) tabs.

TSV certainly is not perfect. But it solves the major problems for 95% of CSVs, and it's just as convenient for humans.

I do agree that one should not arbitrarily munge content. But note that HTML does munge whitespace, and we've never suffered meaningfully for it.

chirau2y ago

I don't think you fully understood what i was saying.

If a regular user had rows like this

Adam Smith 27 WA

JonathanBoyd 23 NC

They are likely going to have tougher time adding a new row as compared to if it was comma delimited. You underestimate the simplicity of end users and how tabs and spaces can confuse them. This is why they prefer Excel, with boxes, because they cannot keep up with formatting and such. Tabs are spaces to many people. Commas are clearer.

aronhegedus2y ago

My takeaway is that csv has some undefined behaviours, and it takes up space.

I like that everyone knows about .csv files, and it's also completely human readable.

So for <100mb I would still use csv.

nradov2y ago

If both parties implement RFC 4180 and use a consistent character set encoding then I don't think there are actually any undefined behaviors. But in practice a lot of implementations are simply broken, including those from major tech companies that ought to know better.

senknvd2y ago

I don't think RFC 4180 differentiates between an empty string and a null value. As long as you add a check that all string columns are free of empty values before writing you should be good.

I think in polars it's

    df.filter(pl.col(pl.Utf8).str.len_bytes() == 0).shape[0] == 0

although there's probably a better way to write this.

nradov2y ago

Well I would consider differentiation between empty string versus null as simply being out of scope for CSV rather than undefined behavior. It was never intended as a complete database dump format.

thayne2y ago

And the application doesn't try to convert the cells into non-string data types like numbers, dates, etc.

nradov2y ago

Converting strings into other data types is out of scope for CSV, not really undefined behavior. The type conversions happen at a later stage of the import process.

1 more reply

SPBS2y ago

1. CSV is for ensuring compatibility with the widest range of consumers, not for ensuring best read or storage performance for consumers. (It is already more efficient than JSON because it can be streamed, and takes up less space than a JSON array of objects)

2. The only data type in CSV is a string. There is no null, there are no numbers. Anything else must be agreed upon between producer and consumer (or more commonly, a consumer looks at the CSV and decides how the producer formatted it). JSON also doesn’t include dates, you’re not going to see people start sending API responses as Apache Parquet. CSV is fiiine.

onethumb2y ago

CSV has some limits and difficulties, but has massive benefits in terms of readability, portability, etc.

I feel like USV (Unicode Separated Values) neatly improves CSV while maintaining most of its benefits.

https://github.com/sixarm/usv

culebron212y ago

The poor performance argument is not true even for Python ecosystem that the author discusses. Try saving geospatial data in GeoPackage, GeoJson, FlatGeobuf. They are saved slower than in plain CSV (the only inconvenience is that you must convert geometries into WKT strings). GeoPackage was "the Format of the Future" 8 years ago, but it's utterly slow when saving, because it's an SQLite database and indexes all the data.

Files in .csv.gz are more compact than anything else, unless you have some very-very specific field of work and a very compressible data. As far as I remember, Parquet files are larger than CSV with the same data.

Working with the same kind of data in Rust, I see everything saved and loaded in CSV is lightning fast. The only thing you may miss is indexing.

Whereas saving to binary is noteably slower. A data in generic binary format becomes LARGER than in CSV. (Maybe if you define your own format and write a driver for it, you'll be faster, but that means no interoperability at all.)

kylebarron2y ago

Sorry, this is not true _at all_ for geospatial data.

A quick benchmark [0] shows that saving to GeoPackage, FlatGeobuf, and GeoParquet are roughly 10x faster than saving to CSV. Additionally, the CSV is much larger than any other format.

[0]: https://gist.github.com/kylebarron/f632bbf95dbb81c571e4e64cd...

culebron212y ago

And here's my quick benchmark, dataset from my full-time job:

  > import geopandas as gpd
  > import pandas as pd
  > from shapely.geometry import Point

  > d = pd.read_csv('data/tracks/2024_01_01.csv')
  > d.shape
  (3690166, 4)
  > list(d)
  ['user_id', 'timestamp', 'lat', 'lon']

  > %%timeit -n 1
  > d.to_csv('/tmp/test.csv')
  14.9 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > d2 = gpd.GeoDataFrame(d.drop(['lon', 'lat'], axis=1), geometry=gpd.GeoSeries([Point(*i) for i in d[['lon', 'lat']].values]), crs=4326)
  > d2.shape, list(d2)
  ((3690166, 3), ['user_id', 'timestamp', 'geometry'])

  > %%timeit -n 1
  > d2.to_file('/tmp/test.gpkg')
  4min 32s ± 7.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > %%timeit -n 1
  > d.to_csv('/tmp/test.csv.gz')
  37.4 s ± 291 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > ls -lah /tmp/test*
  -rw-rw-r-- 1 culebron culebron 228M мар 26 21:10 /tmp/test.csv
  -rw-rw-r-- 1 culebron culebron  63M мар 26 22:03 /tmp/test.csv.gz
  -rw-r--r-- 1 culebron culebron 423M мар 26 21:58 /tmp/test.gpkg

CSV saved in 15s, GPKG in 272s. 18x slowdown.

I guess your dataset is countries borders, isn't it? Something that 1) has few records and makes a small r-tree, and 2) contains linestrings/polygons that can be densified, similar to Google Polyline algorithm.

But a lot of geospatial data is just sets of points. For instance: housing per entire country (couple of million points). Address database (IIRC 20+M points). Or GPS logs of multiple users, received from logging database, ordered by time, not assembled in tracks -- several million per day.

For such datasets, use CSV, don't abuse indexed formats. (Unless you store it for a long time and actually use the index for spatial search, multiple times.)

1 more reply

smcin2y ago

> for geospatial data... GeoPackage was "the Format of the Future" 8 years ago

What's the current consensus? Can you link to a summary article?

(Some still say GeoPackage is: https://mapscaping.com/shapefiles-vs-geopackage/ )

culebron212y ago

I'd say compared to Shapefile, it is indeed better in every aspect (to begin with, shp has 8-character column names limit). For some kinds of data and operations GPKG is superior to other geo-formats. Like 1) store a lot of data, but retreive within an area (you can set an arbitrary polygon as a filter with GDAL driver, IIRC), 2) append/delete/modify and have the data indexed -- with CSV here you'll have to just reprocess and rewrite the entire file.

The problem is that in data science you want whole datasets to be atomic, to have reproducible results. So you don't care much of these sub-dataset operations.

Another sudden issue with GPKG and atomicity is that sqlite changes DB modification time every time you just read. So if you use Makefile, which checks for updates by modification time, you either have to let it re-run some updates, or manually touch other files downstream, or rely on separate files that you `touch` (unix tool that updates file's modification time).

I read a Russian OSM blogger Ilya Zverev evangelize for GPKG back in 2016 in his blog: https://shtosm.ru. I guess he was referring to GPKG vs ShapeFile too, not CSV. I think he's totally correct in this. But look above at my other comment with a benchmark: CSV turns out far easier on resources if you have lots of points.

Back in 2017 I've made a tool that could read and write CSV, Fiona-supported formats (GeoJson, GPKG, CSV, Postgres DB), and our proprietary MongoDB. (Here's the tool, without the Mongo feature https://github.com/culebron/erde/ ) And I tried all easily available formats, and every single one has some favorable cases, and sucks at some other (well, Shapefile is outdated, so it's out of competition). Among them, FGB is kinda like better GPKG if you don't need mutations.

1 more reply

nsjames2y ago

I think the sad reality there is that it's become "the" format that users expect, and more importantly, it's what's integrated into the majority of peripheral services and tools.

Like JSON.

pyr0hu2y ago

Yeah, clients always expect CSV (or sometimes XLSX), but if I tell them that I'll send parquet data, they will ask if I'm having a stroke or something because they don't know what is parquet and how could they use it.

CSV is just too simple and "user-friendly".

bell-cot2y ago

Oh, yes. And even if you can convince the client that they're wrong, and you're right - with substantial client datasets, there's always a load of "data not as previously represented" records. Resolving what is going on with those tends to be vastly easier when you can say "look at record 1,234,567" and they can easily do that in their favorite & familiar software.

2devnull2y ago

Moreover, I myself like being able to open broken csv files in a text editor, to find nulls and other problematic junk.

addminztrator2y ago

Good thing I don't have friends then because I'm exporting to csv, it's simple and it works

kembrek2y ago

If one could export to Parquet from Microsoft Excel, I think this would be a goer. Until such time, it seems likely many will stick with CSVs.

efxhoy2y ago

v1 or v2?

nmz2y ago

If you had the option to use whatever you want, and you're already sacrificing not using a text editor, why would you use parquet instead of the myriad of choices.

divan2y ago

Remembering all the cases I needed to export to CSV – 99% are the relatively small datasets, so marginal gains in a few millisecond to import aren't worth sacrificing convenience. And sometimes you just get data from gazzilion of diverse sources, and CSV is the only option available.

I suspect that not everybody here work exclusively with huge datasets and well-defined data pipelines.

On a practical side, if I want to follow suggestions, how do I export to Avro from Numbers/Excel/Google Sheets?

Black616Angel2y ago

I never liked articles about how you should replace CSV with some other format while pulling some absolutely idiotic reasons out of their rear...

1. CSV is underspecified Okay, so specify it for your use case and you're done? E.g use rfc3339 instead of the straw-man 1-1-1970 and define how no value looks like, which is mostly an empty string.

2. CSV files have terrible compression and performance Okay, who in their right mind uses a plain-text-file to export 50gb of data? Some file systems don't even support that much. When you are at the stage of REGULARLY shipping around files this big, you should think about a database and not another filetype to send via mail. Performance may be a point, but again, using it for gigantic files is wrong in the first place.

3. There's a better way (insert presentation of a filtype I have never heard of) There is lots of better ways to do this, but: CSV is implemented extremely fast, it is universally known unlike Apache Parquet (or Pickle or ORC or Avro or Feather...) and it is humanly readable.

So in the end: Use it for small data exports where you can specify everything you want or like everywhere, where you can import data, because most software takes CSV as input anyway.

For lots of data use something else.

Friends don't let friends write one-sided articles.

HayBale2y ago

2. You would be surprised, especially on the science/university level in stat, health or bioinfo. Unfortunately a lot of people go with the path of least resistance and use excel propertiary format or csv for everything.

Like NHS with their post covid data due to excel limitations or gene name conversion problems in sci journals.

Same happens with stupid amount of laboratory management things or bioinformatics tools.

Honestly obviously the article is biased but we should at least think about moving away from csv in non customer facing fronts.

Small files? Json Big files? SQLlite or parquet.

alserio2y ago

I agree with your other points but the first point misses the mark. Even you specify a format, you cannot use the file for exporting data between systems and organizations if they don't all agree on that format. CSV does not have a reasonable way to encode that is using a specific spec. I can open your data with my tools and silently misinterpret it. But if you are only exporting data between yourself, that's another story.

2devnull2y ago

You can use excel as the lingua franca. Also give them a row/col counts. Most problems solved in two easy steps.

javcasas2y ago

For lots of data zip the csv. For REALLY lots of data, think something different.

lolive2y ago

The reason why USV did not use the proper ASCII codes for field separator and record separator is a bit too pragmatic for me…

https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...

PLenz2y ago

Everything reads it, everything writes it. CSV is the one true data format to which everything else will eventually be converted to by users.

WatchDog2y ago

The only real problems I ever have with CSV, is when excel is involved.

benob2y ago

Can Parquet be read/parsed in almost every programming language with very little effort?

bluenose692y ago

I just tried it in R. The relevant package seems to be "arrow", so I did

    install.packages("arrow")

and then I did

    ?read_parquet

to get an example. I tried the example, and got the error message as follows. This sort of error is really quite uncommon in R. So my answer to the "with little effort" is "no", at least for R.

    > tf<-tempfile()
    > write_parquet(mtcars, tf)
    Error in parquet___WriterProperties___Builder__create() : 
  Cannot call parquet___WriterProperties___Builder__create(). See https://arrow.apache.org/docs/r/articles/install.html for help installing Arrow C++ libraries.

2devnull2y ago

Arrow going back and forth between r/python can be a catch too iirc.

vesinisa2y ago

Exactly. The author even opens the article with the nice trivia that CSV has been in use since the 1970s. I don't think anyone disputes that CSV is a very primitive format. But I hope no one uses CSV because it is so performant or well-designed. It is used exactly because it is so universal, which makes the point of comparing against almost any other format moot unless they were around in the 1970s, too.

addandsubtract2y ago

More importantly, can you open parquet files in Excel?

JumpCrisscross2y ago

No, you have to convert it to CSV first [1]. Or install a driver [2].

[1] https://www.gigasheet.com/post/how-to-open-parquet-file

[2] https://www.cdata.com/kb/tech/parquet-odbc-excel-query.rst

jjgreen2y ago

An article promoting parquet over CSV. Fair enough, but parquet has been around for a while and still no support in Debian. Is there some deep and dark reason why?

Cheer21712y ago

What do you mean there is no parquet support in Debian? Data formats should be supported in userspace and there are plenty of parquet libraries and userspace tools an apt-get away. There is exactly as much support for tar in Debian as there is for parquet.

jjgreen2y ago

  You have searched for packages that names contain parquet in suite(s) bookworm,   
  all sections, and all architectures.

  Sorry, your search gave no results

https://packages.debian.org/search?suite=bookworm&searchon=n...

wizzwizz42y ago

By the same logic, there's no Photoshop Document support – but GIMP and Krita both support it.

1 more reply

TrackerFF2y ago

The problem, as always, is that you deal with multiple data sources - which you can not control the format of. I work as a data analyst, and in my day-to-day work I collect data from around 10 different sources. It's a mix of csv, json, text, and what not.

Nor can you control the format others want. The reason I have to export to csv, is unfortunately because the people I ship out to use excel for everything - and even though excel does support many different data formats, they either enjoy using .csv (should be mentioned that the import feature in excel works pretty damn well), or have some system written in VBA that parses .csv files.

lolive2y ago

As a data architect in a big company, I cannot tell how harmful such a stupid data format CSV can be. All the possible semantics of the data has to be offloaded to either the brain of people [don’t do that! Just don’t!] or out-of-sync specs [better hidden in the CMS of the company that the Ark of Alliance, and outdated anyway] or obscure code or SQL queries [an opportunity for hilarious reverse engineering sessions, where you hate a retired developper forever for all the tricks he added inside code to circumvent poorly defined data. Then got away to Florida beach after hiring you.]

lolive2y ago

The best thing I see really often is people sending the data model of a CSV file as, #guessWhat, ANOTHER CSV file!!!

[please kill me!]

aabbcc12412y ago

An alternative is to export to sqlite file

randomsolutions2y ago

I like the ping pong of one day an article being posted where everyone asks, "when/why did everything become so complicated", and then the next day something like this is posted.

j7ake2y ago

I wouldn’t say the article proposes a better way, but he proposes rather a more complex way.

Nothing beats CSV in terms of simplicity, minimal friction, and ease of exploring across diverse teams.

julik2y ago

"There's a better way" - "just" write your application in Java or Python, import Thrift, zstandard and boost, do some compiling - and presto, you can now export a very complicated file format you didn't really need which you hope your users (who all undoubtedly have Java and Python and Thrift and whatnot) will be able to read.

CSV does not deserve the hate.

jgord2y ago

CSV is a superb, incredibly useful data format.. but not perfect or complete.

Instead of breaking CSV by adding to it .. I recommend augmenting it :

It would be useful to have a good standardized / canonical json format for things like encoding, delimiter, schema and metadata, to accompany a zipped csv file, perhaps packaged in the same archive.

Gradually datasets would become more self-documenting and machine-usable without wrangling.

cxr2y ago

> It would be useful to have a good standardized / canonical json format for things like encoding, delimiter, schema and metadata

We already have that. Dan Brickley and others put a lot of thoughtful effort into it <https://www.w3.org/TR/tabular-data-primer/#dialects>:

> A lot of what's called "CSV" that's published on the web isn't actually CSV. It might use something other than commas (such as tabs or semi-colons) as separators between values, or might have multiple header lines. [...] You can provide guidance to processors that are trying to parse those files through the `dialect` property

As is usually the case with standards, it's not that the standard doesn't exist but that people just don't even bother checking (much less caring about what it says or actually trying to follow it).

2devnull2y ago

“the use case where people often reach for CSV, parquet is easily my favorite”

My use case is that other people can’t or won’t read anything but plain text.

baazaa2y ago

Schemas are overrated. Often the source-system can't be trusted so you need to check everything anyway or you'll have random strings in your data. Immature languages/libraries often do dumb stuff like throwing away the timezone before adjusting it to UTC. They might not support certain parquet types (e.g. an interval).

Like I've recently found it much easier to deal with schema evolution in pyspark with a lot of historical CSVs than historical parquets. This is essentially a pyspark problem, but if everything works worse with your data format then maybe it's the format that's the problem. CSV parsing is always and everywhere easy, easier than the problems parquets often throw up.

The only time I'd recommend parquet is if you're setting up a pipeline with file transfer and you control both ends... but that's the easiest possible situation to be in; if your solution only works when it's a very easy problem then it's not a good solution.

rossvor2y ago

Friends don't let friends export to CSV [for my specific use case]

atoav2y ago

CSV is totally fine if you use it for the right kind of data and the right application. That means:

- data that has predictable value types (mostly numbers and short labels would be fine), e.g. health data about a school class wouldn't involve random binary fields or unbounded user input

- data that has a predictable, managable length — e.g. the health data of the school class wouldn't be dramatically longer than the number of students in that class

- data with a long sampling period. If you read that dataset once a week performance and latency become utterly irrelevant

- if the shape of your data is already tabular and not e.g. a graph with many references to other rows

- if the gain in human readability and compatibility for the layperson outweighs potential downsides about the format

- if you use a sane default for encoding (utf8, what else), quoting, escaping, delimiter etc.

Every file format is a choice, often CSV isn't the wrong one (but: very often it is).

_shantaram2y ago

Surprised no one has mentioned sqlite even once in these comments.

rkaveland2y ago

I regret forgetting about it in the article. sqlite is a great solution.

reportgunner2y ago

"Friends don't let friends write SQL" /s

sam_goody2y ago

I have all SQL exported to CSV and committed to git once a day (no, I don't think this is the same as WAL/replication).

Dumping to CSV is built into MySQL and Postgres (though MySQL has better support), is faster on export and much faster on import, doesn't fill up the file with all sorts of unneeded text, can be diffed (and triangulated by git) line by line, is human readable (eg. grepping the CSV file) and overall makes for a better solution than mysqldumping INSERTs.

In Docker, I can import millions of rows in ~3 minutes using CSV; far better than anything else I tried when I need to mock the whole DB.

I realize that the OP is more talking about using CSV as a interchange format or compressed storage, but still would love to hear from others if my love of CSV is misplaced :)

tracker12y ago

I tend to prefer line delimited JSON myself, even if it's got redundant information. It will gzip pretty well in the data if you want to use less storage space.

Either that or use the ASCII codes for field and row delimiters on a UTF-8 file without a BOM.

Even then you're still stuck with data encoding issues with numbers and booleans. And that direct even cover all the holes I've seen in CSV in real world use by banks and govt agencies over the years.

When I've had to deal with varying imports I push for a scripted (js/TS or Python) preprocessor that takes the vender/client format and normalized to line delimited JSON, then that output gets imported. It's far easier than trying to create a flexible importer application.

Edit: I've also advocated for using SQLite3 files for import, export and archival work.

cess112y ago

SQL and XML have schemas, and they're to a large extent human readable, even to people who aren't developers. If storage is cheap, compression isn't very important.

I've never come across this Parquet-format, is it grep:able? Gzip:ed CSV is. Can a regular bean counter person import Parquet into their spreadsheet software? A cursory web search indicates they can't without having a chat with IT, and SQL might be easier while XML seems pretty straightforward.

Yes, CSV is kind of brittle, because the peculiarities with a specific source is like an informal schema but someone versed in whatever programming language makes this Parquet convenient won't have much trouble figuring out a CSV.

mrozbarry2y ago

Use the right data format for the right data. CSV can be imported into basically any spreadsheet, which can make it appealing, but it doesn't mean it's always a good option.

If you want csv, considering a normalization step. For instance, make sure numbers have no commas and a "." decimal place. Probably quote all strings. Ensure you have a header row.

Probably don't reach for a CSV if:

- You have long text blobs with special characters (ie quotes, new lines, etc.)

- You can't normalize the data for some reason (ie some columns have formulas instead of specific data)

- You know that every user will always convert it to another format or import it

johnea2y ago

I've written CSV exports in C from scratch, no external dependencies required.

It's "Comma Separated Variables", it doesn't really need anymore specification than that.

These files have always imported into M$ and libre office suites without issue.

msla2y ago

Comma Separated Values.

Normally, I wouldn't nitpick (hold for laughter) but this is just a perfect example of how CSV is not CSV and how it's probably impossible to support everything people call CSV, simply because of the tiny (or not so tiny) differences between formats.

nradov2y ago

It needs more specification than that. Have you read RFC 4180?

nmz2y ago

oh boy. here's where it breaks

- supporting "" (single)

- Supporting newlines in "", oops, now you can't getline() and instead need to getdelim()

- Supporting comments # (why is this even a thing)

- Supporting multiple "" in a field

- Escaping " with "" or \"

- length based csv, so all fields are seekable.

It's a mess, which one's your csv?

outop2y ago

The vast majority of CSVs do not have strings which include either quotes or newlines.

No CSV I have ever encountered has comments.

nmz2y ago

Google contacts does.

1 more reply

eviks2y ago

So you're fine with a lot of bugs in the case of a vast minority?

1 more reply

bazoom422y ago

Here is a crazy idea: So csv itself is abiguous, but as a convention we could encode the options in the file name. E.g data.uchq.csv means utf8, comma-separated, with header, quoted.

1 more reply

mtr2y ago

What's the best way to expose random CSV/.xlsx files for future joins etc? We're house hunting and it would be nice have a local db to keep track of price changes, asking prices, photos, etc. And look up (local) municipal OpenData for an address and grab the lot size, zoning, etc. I'm using Airtable and sometimes Excel, but it would be nice to have a home (hobby) setup for storing queryable data.

Twirrim2y ago

One particularly memorable on-call shift had a phenomenal amount of pain caused by the use of CSV somewhere along the line, and a developer who decided to put an entry "I wonder, what happens if I put in a comma", or something similar. That single comma caused hours of pain. Quite why they thought production was the place to test that, when they knew the data would end up in CSV, is anybody's guess.

I think Hanlon's razor applies in that situation.

coxley2y ago

xsv makes dealing with csv miles easier: https://github.com/BurntSushi/xsv

ysris2y ago

Not sure I understand right what this article is about. From my point of view, CSV is an easy way to export data from a system to allow an end user to import it in excel and work on this data. Apart if it's as easy with parquet to import in excel as with a CSV, I'm not sure this is not fixing a problem that doesn't exist. And making things more complicated.

Outside of the context of end user, I don't see any advantages in this compared to xml or json export.

QuiDortDine2y ago

> You give up human readable files, but what you gain in return is incredibly valuable

Not as valuable as human-readable files.

And what kind of monstrous CSV files has this dude been working with? Data types? Compression? I just need to export 10,000 names/emails/whatevers so I can re-import them elsewhere.

Like, I guess once you start hitting GBs, an argument can be made, but this article sounds more like "CSV considered harmful", which is just silly to me.

lakomen2y ago

Ok, funny guy. Tell that to all the wholesale providers, which use software from 2005 or at least it feels that way.

No query params in their single endpoint and only csv exports possible.

Then add to that, that shopify, apparently the leader or whatever in shopping software, can't do better than require exactly the format they say, don't you dare coming with configurable fields or mapping.

The industry is stuck in the 00s, if not 90s.

anothernewdude2y ago

I like how their alternative is an instant non-starter.

thesnide2y ago

Yes... if i need a better format, i'll just use sqlite.

throwitaway2222y ago

I'd rather work with someone that prefers a format, but doesn't write articles like this. It's fine to "prefer" parquet, but CSV is totally fine - whatever works mate.

When you hit the inevitable "friends don't let friends" or "considered harmful" type of people, it's time to move quickly past them and let the actual situation dictate the best solution.

jll292y ago

Whether or not you use Parquet is one thing, but CSV will stay because any achival/data exchange format should be human readable.

gmoot2y ago

"I'm a big fan of Apache Parquet as a good default. You give up human readable files, but..."

Lost me right there. It has to be human readable.

lolive2y ago

I still don’t understand how you deal with cardinalities in a CSV. You always recreate an object model on top of it to deal with them properly ?

Cf a tweet I wrote in one of my past lives: https://x.com/datao/status/1572226408113389569?s=20

Zababa2y ago

It is weird to say both that "CSV files have terrible compression" and then that the proposed format, Apache Parquet, has "Really good compression properties, competitive with .csv.gz". I think what's meant here is that csv compresses really well but you loose the ability to "seek" inside the file.

ykonstant2y ago

I finally set aside my laziness and started a thread on r/vim for the .usv project:

https://www.reddit.com/r/vim/comments/1bo41wk/entering_and_d...?

mlhpdx2y ago

I’ve always liked CSV. It’s a streaming friendly format so:

- the sender can produce it incrementally

- the receiver can begin processing it as soon as the first byte arrives (or, more roughly, unescaped newline)

- gzip compression works without breaking the streaming nature

Yeah, it’s a flawed interchange format. But in a closed system over HTTP it’s brilliant.

ratherbefuddled2y ago

Ubiquity has a quality all of its own.

Yes CSV is a pain in many regards, but many of the difficulties with it arise from the fact that anybody can produce it with very little tool support - which is also the reason it is so widely used.

Recommending a decidedly niche format as an alternative is not going anywhere.

croes2y ago

>Numerical columns may also be ambigious, there's no way to know if you can read a numerical column into an integral data type, or if you need to reach for a float without first reading all the records.

Most of the time you know the source pretty well and can simply ask about the value range.

duped2y ago

If you just assume f64 then you have 53 bits of integer precision which is more than enough for the fast majority of applications. If JS hasn't proven this thoroughly, I don't know what has.

Obviously there are edges, but they're edges by nature. And like you say, you usually know the source pretty well.

Evidlo2y ago

There is CSVY, which lets you set a delimiter, schema, column types, etc. and has libraries in many languages and is natively supported in R.

Also is backwards-compatible with most CSV parsers.

https://github.com/leeper/csvy

darrmit2y ago

I gave up at "You give up human readable files". While I recognize in some cases these recommendations may make sense/CSV may not be ideal, the idea of a CSV _export_ is generally that it could need to be reviewed by a human.

neonsunset2y ago

If you ever need to parse CSV really fast and happen to know C#, there is an incredible vectorized parser for that: https://github.com/nietras/Sep/

osloensis2y ago

This is the level of discourse among Norwegian graduates. Half of them are taught to worship low level, the other half has framework diabetes.

Don't come here to work if you don't want to drown in nitpicking and meaningless debates like this.

codeonline2y ago

The utillity of a file being human readable cant be overstated.

File formats like CSV will outlast religion.

braiamp2y ago

Openrefine has saved my bacon more times that I care to admit. It ingest everything and have powerful exporting tools. Friends give friends CSV files, and also tell them about tools that help them deal with wide array of crap formats.

mukundesh2y ago

Using parquet in python requires installing pyarrow and numpy, whereas CSV comes with stdlib.

Also, the csv has a very pythonic interface vis-a-vis parquet, in most cases if I can fit the file in memory I would go with CSV.

bongodongobob2y ago

The author seems to be missing the point of CSVs entirely. I looked him up expecting a fresh college grad, but am surprised to see he's probably in his early 30s. Seems to be in a dev bubble that doesn't actually work with users.

Try telling 45 year old salesman he needs to export his data in parquet. "Why would I need to translate it to French??"

I feel like I'm pretty up to date on stuff, and I've never heard of parquet or seen in as an option, in any software, ever.

unsupp0rted2y ago

I wish I could get Excel to stop converting Product UPCs to scientific notation when opening CSVs.

Also some UPCs start with 0

Worst is when Excel saves the scientific notation back to the CSV, overwriting the correct number.

reportgunner2y ago

1. Open a blank workbook

2. Enable the legacy Text Import Wizard as per [0]

3. Go to Data -> Get Data -> Legacy Wizards -> From Text (Legacy)

4. Set config based on your CSV file, typically select "Delimited" and "My data has headers" enabled

5. Click Next and pick the delimiter, typically "Comma"

6. Click Next and click the columns with UPCs, select "Text" in the "Column data format" area

7. Click Finish

(I'm not saying this is great, just sharing how to do it in case you don't know)

[0] https://professor-excel.com/import-csv-text-files-excel/

edit: you can also set up a PowerQuery query that will always open some CSV at some path and apply this config, but I don't want to have anything to do with PowerQuery, sorry.

unsupp0rted2y ago

Wow! I had no idea you could set data format on legacy text columns during import. I had thought the column preview was just that- not a selectable radio button that you can then apply column data formatting to.

Thanks for the instructions! It's cumbersome as heck, but it's better than nothing.

1 more reply

throwaway383752y ago

CSVs won the war. No vendor lock in and very portable.

I wish TSVs were more popular though. Tabs appear less frequently than commas in data.

My biggest recommendation is to avoid Excel! It will mangle your data if you let it.

thepra2y ago

I use JSON for import/export of user data (in my super app collAnon), it's more predictable and the toolings around it to transform into any other format(even csv) is underappreciated, imo.

thayne2y ago

Parquet is a columnar format. Which might be what you want, but it also might not, like if you want to process one row at a time in a stream. Maybe avro would be a better format in that case?

MisterBastahrd2y ago

Trying to do business without using CSV is like trying to weld without using a torch. Might be possible but you aren't likely to have success at it.

shadowgovt2y ago

CSV is still a nice, compact intermediate between ease of reading and ease of processing, which is an advantage most alternatives lack.

3922y ago

I must have missed when Excel added Parquet support.

bongodongobob2y ago

Yeah, I think the use case for CSVs 99% of the time is so you can load it into Excel and do things with it. Any other use case is going to involve a DB and if you're exporting databases to import to Excel or another DB, you are in fact doing it wrong.

yipeeeerrrrr2y ago

Ingesting data via CSV with Azure Polybase is one of the fastest things I have encountered. +1 for CSV.

rascul2y ago

CSV can be fine with some well defined datasets. It can get weird in other cases, though.

kvakerok2y ago

Friends also don't allow friends only CSV import of data, but here we are.

fifilura2y ago

"Friends don't send parquet files to analysts who wants them in their spreadsheet program"

visitor47122y ago

excel 2021: the "a spreadsheet is all it needs"-file is not usable because excel is not able to translate the "LC references that are inside brackets" into other languages.

cozzyd2y ago

there needs to be some pandoc (panbin?) for binary formats to convert between parquet, hdf5, fits, netcdf, grib, ROOT, sqlite, etc. (Ok these are not all equivalent in capability...).

cheald2y ago

"Okay, but how do I open it in Excel?"

liquidify2y ago

what about just exporting to sqlite files?

outop2y ago

This is the right answer for the cases when your complexity grows beyond csv.

rkaveland2y ago

sqlite is an excellent choice.

nxpnsv2y ago

Csv sure is a step up from excel though…

noddingham2y ago

Tell me you've never worked a real job without telling me. This is a technologists solution in search of a problem. Do you also argue that "email is dead"?

snissn2y ago

new line seperated json "JSONL"

atroxone2y ago

"csv is terrible!"

Screams the rustocean from his ivory tower

F_J_H2y ago

meh

JonChesterfield2y ago

XML. Not CSV, not Parquet (whatever that is), not protobufs. Export the data as XML, with a schema. Not json or yaml either. You can render the XML into whatever format you want downstream.

The alternative path involves parsing csv in order to turn it into a different csv, turning json into yaml and so forth. Parsing "human readable" formats is terrible relative to parsing XML. Go with the unambiguous source format and turn it into whatever is needed in various locations as required.

j / k navigate · click thread line to collapse

440 comments

prepend2y ago

This article seems written by someone who never had to work with diverse data pipelines.

Csv that’s zipped is producible and readable by everyone. And that makes is more efficient.

It’s like arguing how much mp3 sucks and how we should all just use flac.

The author means well, I’m sure. Maybe his next article will be about how airlines should speak Esperanto because English is such a flawed language. That’s a clever and unique observation.

hn_throwaway_992y ago

diogocp2y ago

> in the real world the spec is essentially "Can you import it to Excel?"

Freak_NL2y ago

3 more replies

chaps2y ago

"Always no"? What. Really not sure what you mean here when you agree that the possibility of it working exists.

1 more reply

Pikamander22y ago

I've been amazed by how much better LibreOffice is at importing CSVs in a sane manner than Excel. Its CSV import prompt is nothing short of the gold standard and puts Excel to shame.

3 more replies

Stranger432y ago

1 more reply

pif2y ago

>> in the real world the spec is essentially "Can you import it to Excel?"

> And the answer to that is always no.

Sorry, you are wrong! You are confusing "No" and "Yes, after a quick setup at most".

1 more reply

mcv2y ago

alexvoda2y ago

On the other hand you can now use Power Query to import perquet data into Excel.

1 more reply

goatlover2y ago

If the answer was always no, importing CSVs to Excel wouldn't be an expectation or widely used.

2 more replies

djbusby2y ago

ASCII has had field and record separators since like, forever. Wish we had kept using those.

wombatpm2y ago

3 more replies

josefx2y ago

Does it have nesting operators? I want to embedd ASCII within my ASCII fields. So I can have a table within my table.

3 more replies

lolive2y ago

https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...

1 more reply

ryncewynd2y ago

Are there unicode characters specifically for delimiters?

If Excel had a standardised "Save as USV" option it would solve so many issues for me.

I get so many broken CSVs from third-parties

hiccuphippo2y ago

ASCII has characters for unit, record, group and file separator. And a some days ago there was a story here about using the unicode printable representation of these for an editor friendly format.

https://news.ycombinator.com/item?id=39679378

crq-yml2y ago

Most of the issue hinges on Excel support.

1 more reply

bhaney2y ago

Hell, there's ASCII characters specifically for delimiters. 0x1C to 0x1F are respectively defined as file, group, record, and unit separators. Unicode naturally inherits them all.

1 more reply

fbdab1032y ago

My significantly bigger beef would be all of the auto-formatting Excel does to mangle data. Excel loves to turn entries into dates.

Human genes had to be renamed so as to avoid this Excel features.

1 more reply

hn_throwaway_992y ago

Yes, this recent discussion has lots of good info and links in the comments: https://news.ycombinator.com/item?id=39679378

lostlogin2y ago

> Are there unicode characters specifically for delimiters?

We could use the HL7 pipe ‘|’ and all enjoy that hell.

3 more replies

Dalewyn2y ago

>I do hope "Unicode Separated Values" takes off. It's essentially just a slight tweak to CSV where the delimiters are special unicode characters

Commas can be typed by anyone on any keyboard and readable by anyone.

Special Unicode Characters(tm) can't be typed by anyone on any keyboard and readable by noone.

Convenience is a virtue.

hn_throwaway_992y ago

USV supports displayable delimiters (see https://github.com/SixArm/usv), so for the much more common case of editing an existing CSV in a text editor, you can just copy and paste.

2 more replies

deathanatos2y ago

> Special Unicode Characters(tm) can't be typed by anyone on any keyboard and readable by noone.

It would be quite tedious, I do admit; reasonably typed by someone, probably not.

I don't normally type out CSVs by hand either, though.

(¹I don't know about Windows as I simply haven't cared about it in years. For all I know, they've grown an IME in the intervening decade.)

3 more replies

medstrom2y ago

The IANA standard for TSV already disallows tabs inside fields, so you can skip writing any quoting logic (in principle). The MIME type is `text/tab-separated-values`.

https://www.iana.org/assignments/media-types/text/tab-separa...

chrisandchris2y ago

Besides all the downsides CSV has, as soon as it's not only machine-machine communication and a human is involved, CSV is just simole enough.

xeromal2y ago

paulddraper2y ago

USV has a mountain of problems.

And really is in search of a problem to solve.

palmfacehn2y ago

JSON objects as a CSV field has been mostly agreeable for my usage. It would be nice if some of the spreadsheet apps displayed the object tree.

puppymaster2y ago

ditto. if it fits in the ram, file types don't matter.

likis2y ago

It seems like you missed the conclusion in the article. If users want CSV exports, let them have it.

If you have important data being shuffled around systems, pick something with a specification instead.

whakim2y ago

prepend2y ago

I deal with gig size csvs all the time and don’t have any performance issues. These aren’t huge files, but decent sized. And most are just a few megs and only thousands to millions of records.

Csv is not very performant, but it doesn’t matter for these use cases.

whakim2y ago

1 more reply

Certhas2y ago

I have had many other situations where CSV was the bottleneck.

I still would default to CSV first in many situations because it's robust and easily inspected by hand.

smcin2y ago

> That being said my workstation is pretty big and can handle 10s of gigs of csv before I care.

How much RAM do you have? What's the ratio of [smallest CSV file which bottlenecks]/[your RAM]?

1 more reply

strken2y ago

I very much agree with this. For an integration where you have control over both ends of the pipeline, CSV is not optimal unless there's existing work to build on, and even then it's a legacy choice.

jgord2y ago

btw, xsv has solved most of my problems dealing with 'large' 40GB csv files

throwaway20372y ago

xsv? I never heard of it. This one? https://github.com/BurntSushi/xsv

If yes, looks very cool. Plus, bonus HN/Internet points for being written in Rust!

1 more reply

vmchale2y ago

its parser is buggy! https://github.com/BurntSushi/xsv/issues/337

(I ran into this issue myself)

1 more reply

chaps2y ago

Why is this getting downvoted? They're right that the criticism is pretty excessive:

  "Maybe his next article will be about how airlines should speak Esperanto because English is such a flawed language. That’s a clever and unique observation."

Hm.

prepend2y ago

I got a little snarky but I think the analogy holds.

Esperanto is a superior language to English. And English has many flaws.

Theoretically it would be better to have all pilots and airports learn an efficient language.

But it would be stupid and immature to seriously write a blog post about that, especially without talking about all the flaws in that plan.

2 more replies

gxs2y ago

Agree.

Not saying csv doesn’t have its issues, but I don’t think the author made a convincing argument.

A lot of the issues the author brought up didn’t sound that bad and/or it sounds like he never looked at the source data first.

If you’re doing work with large datasets, I think it’s a good practice to at least go and look at the source data briefly to see what to expect.

This will give you a good idea of the format it outputs, data types, some domain context, etc. or some combination thereof and I don’t think it even takes that long.

Again, I’m not saying csv doesnt have issues or that there aren’t better alternatives, simply that I didn’t find this particular argument convincing.

mrgoldenbrown2y ago

gxs2y ago

I think what you’re saying is accurate, but it’s also important to be practical about stuff.

These are pretty well know excel limitations by now.

And really, anyone using excel who is somehow not aware of that limitation is probably not someone yet experienced enough to be working on a larger and/or mission critical dataset to begin with.

What is the exact number of mistakes where you make the decision to jump to a new format?

I’m not sure. This does happen eventually, but the author didn’t make a strong case here imo.

likis2y ago

But the point is that you don't have to look at the source data if you have an actual specification and defined format, right?

eviks2y ago

> all these sources aren’t able to agree on some successful format.

But the same is true for csv, and they are not readable by everyone since you don't always know how to read them, there is not enough info for that

Also it's not a good reflection on "deep experience" if it leads to reflexive defense of common stupid things people do with wrong analogies (e.g, flac is less efficient, so more like csv)

prepend2y ago

In my experience csv has the fewest problems. Not that it has zero problems.

lobochrome2y ago

For me, the giveaway was:

"You give up human readable files,..."

I was genuinely interested in some alternative suggestions - but the human readableness of csv is what makes it so sticky imo.

DontchaKnowit2y ago

The evervescent suggestions of a brighter more logical, even obvious, solutions, is often a clear indicator of domain inexperience or ignorance.

Oioioioiio2y ago

I worked plenty enough with 'diverse data pipelines' and most of them were shit due to other companies just not knowing how to work properly.

CSV created tons of issues regarding encoding, value separation etc.

I started talking to our customers and were able to define interfaces with better and aligned format. json made my life easier.

netcan2y ago

So...

In some senses, I think internet culture (maybe modern intellectual culture generally) gets stuck in these repetitive conversations.

Reprosecuting without seemingly knowing about all the previous times the conversation has been had.

mannyv2y ago

And it's surprisingly hard for etl departments to export to csv correctly. I mean, if they can't do csv they can't do anything more complicated for sure.

iainmerrick2y ago

This article seems written by someone who never had to work with diverse data pipelines

I think that's a little unfair, it sounds like the author does have a decent amount of experience working with real-world CSV files:

I remember spending hours trying to identify an issue that caused columns to "shift" around 80% into a 40GB CSV file, and let me tell you, that just isn't fun.

fennecbutt2y ago

Yup, csv is always the best fallback, imo. It's: easily generated, easily parsed, human readable/editable, compact, portable, list goes on.

msla2y ago

> Csv that’s zipped is producible and readable by everyone. And that makes is more efficient.

If only CSV were CSV, as opposed to some form that's 80-90% CSV by line count with enough oddities to really make the parser ugly and hard to read.

prepend2y ago

There’s definitely hair to deal with and it’s a little messy, but it’s never a blocker.

But I’d say the error rate is actually very low, maybe .1-1% and nowhere near 10-20% of data being messed up.

wolrah2y ago

> But I’d say the error rate is actually very low, maybe .1-1% and nowhere near 10-20% of data being messed up.

The thing with CSV-related issues is it's usually not a fixed percentage but instead depends on the data.

nrdvana2y ago

Depends on your tools, I suppose. I'd just like to share this:

https://metacpan.org/pod/Data::TableReader::Decoder::IdiotCS...

likis2y ago

Did you actually read the conclusion at the end of the article?

hackerlight2y ago

eviks2y ago

You can open out easily, but just as easily it can be wrong. So with this bias you'd still not export csv, you'd use xls

tdudhhu2y ago

"Of course there are better formats, but all these sources aren’t able to agree on some successful format."

It's the same with csv. They come in all kinds of formats because nobody agreed on the standard. Comma separated, semicolon separated, pipe separated, escaped, not escaped.

Everytime I have to deal with csv I first have to figure out how to parse it in code.

So I think the author is right, we must agree on a better format because that is what friends do.

You are also right because it's an illusion to think that this is going to change anythime soon. But who knows..

iamflimflam12y ago

gerdesj2y ago

... And what's more, you'll be an Engineer my son.

rossdavidh2y ago

"You give up human readable files, but what you gain in return is..." Stop right there. You lose more than you gain.

And if it doesn't work, I can look at the csv myself, because it's human readable, to see what the problem is.

TimTheTinker2y ago

I think his arguments apply more closely to SQLite databases. They're not directly human readable, but boy are there a lot of tools for working with them.

askvictor2y ago

It's also worth noting that SQLite can ingest CSV files into memory and perform queries on them directly - if the files are not too large, it's possible to bypass the sqlite format entirely.

thenbe2y ago

> we need to easily be able to see diffs between different versions

[0]: https://youtu.be/Md44rcw13k4?t=540 [the relevant timestamp is @ 9:00]

[1]: https://github.com/kriansa/dotfiles/blob/7a8c1b11b06378b8ca8...

[2]: https://github.com/kriansa/dotfiles/blob/7a8c1b11b06378b8ca8...

[3]: https://github.com/datafold/data-diff

zachmu2y ago

Somebody already said this, but we built exactly this and it's called Dolt.

https://github.com/dolthub/dolt

Would love to hear how it addresses your use case or falls short.

john-shaffer2y ago

Have you considered https://github.com/dolthub/dolt for your use case?

ultrarunner2y ago

Real world example of this that we just experienced:

I work with a provider who offers CSV exports as the only way to access data. Recently, we found they were including unsanitized user input directly in fields. They weren't even quoting these fields.

Too2y ago

This sounds like a problem that wouldn’t have existed in the first place if following a binary protocol with a standard format and using a proper serialization library.

The issue comes from CSV files looking easy to generate by hand, when it in fact is not.

2 more replies

saulrh2y ago

asddubs2y ago

emj2y ago

That makes the delimiter "," which is ugly so someone is just going to use , instead and you are back to square one.

1 more reply

GuB-422y ago

As a French, there is another problem with CSV.

In the French locale, the decimal point is the comma, so "121.5" is written "121,5". It means, of course, that the comma can't be used as a separator, so the semicolon is used instead.

There is a CSV standard, RFC 4180, but no one seems to care.

layer82y ago

It’s unfortunate that there isn’t a single CSV format, but for historical reasons it is what it is. It’s effectively more like a family of formats that share the same file extension.

Excel actually has a convention where it understands when there is a line

   sep=;

at the start of the file.

By the way, in addition to differing separators, you can also get different character encodings.

Excel does understand a BOM to indicate UTF-8, but some versions of Excel unfortunately ignore it when the “sep=“ line is present…

al_borland2y ago

>sep=;

heresie-dabord2y ago

I have sometimes distributed a CSV file with a similarly-named text file that contains key-value pairs to aid in the use of the CSV file.

A minimal text file of this type would contain:

    #key=value
    sep=;

eyegor2y ago

Gabriel542y ago

If I'm not mistaken this is pretty universal outside of the US (and maybe the UK).

sfRattan2y ago

Going by the Wikipedia article and included map, use of comma versus period as decimal separators is roughly an even split:

https://en.wikipedia.org/wiki/Decimal_separator

https://commons.wikimedia.org/wiki/File:DecimalSeparator.svg

Larrikin2y ago

Seems geographically split, but I wonder what is the actual population split is. Most of the top 10 population countries use the decimal separator. Only Brazil, Russia and Indonesia don't.

Maybe someone with a CSV of the world populations and a CSV of the countries broken down by their separator can do that comparison.

fshr2y ago

There's definitely a big distribution disparity. 11 of the 15 most populous countries use the period for decimals.

1 more reply

zztop442y ago

You are mistaken. Probably more countries overall use a decimal comma, but the decimal point is used as convention in many countries, including China, India, Nigeria and the Philippines.

echoangle2y ago

Does any tool seriously localize CSVs?

elric2y ago

Overall, semicolon feels like the superior delimiter.

Most sensible people don't export formatted numbers (e.g. 100.000,00), but even those are pretty trivial to import.

nlitened2y ago

So many tools do that, even (most importantly) Excel does

kmmlng2y ago

henriquecm82y ago

I find out that Excel follows windows locale, you can change the decimal and the thousand separator on windows, and it will affect how excel exports and reads CSVs.

stephenr2y ago

Isn't this exactly what quoting solves?

i.e.: ``` "1,20","2,3",hello "2,40","4,6",goodbye ```

ric2b2y ago

Now all your numbers are strings

1 more reply

junto2y ago

Same in German. We have things like 1.500.021,92 (I.e. 1,500,021.92)

gerdesj2y ago

The world is odd. We have to work within an odd world.

Interestingly enough, you and I could look at a CSV encoded data set with commas as decimal separators and work out what is going on. You'll need a better parser!

lolive2y ago

Don’t look at the ASCII table, at entries 28 to 31. You are not mentally ready for what you will find there.

junto2y ago

> ..what you will find there

Made me laugh, especially since you can’t “see” them.

https://www.lammertbies.nl/comm/info/ascii-characters

frob2y ago

msla2y ago

> In the French locale, the decimal point is the comma, so "121.5" is written "121,5". It means, of course, that the comma can't be used as a separator

Heh. Ah, HN. Always good for a laugh line.

samatman2y ago

stephenr2y ago

Surely if they're putting commas in values they were quoting the values though?

1 more reply

msla2y ago

That's exactly my point, and I still got downvoted.

This place...

Closi2y ago

Of course if you only consider the disadvantages, something looks bad.

The advantages of CSV are pretty massive though - if you support CSV you support import and export into a massive variety of business tools, and there is probably some form of OOTB support.

anymouse1234562y ago

This is the biggest win IME.

You have a (usually) portable transport format that can get the information into and out of an enormous variety of tools that do not necessarily require a software engineer in the middle.

I'm also struggling with such a quick dismissal of human readable formats. It's a huge feature.

Of course, CSV is a terrible format that can be horribly painful. No argument there.

But despite the pain, it's still far better than many alternatives. In many situations.

chasil2y ago

In a POSIX shell, I actually prefer to use the bell character for IFS.

  while IFS="$(printf \\a)" read -r field1 field2...
  do ...
  done

This works just as well as anything outside the range of printing characters.

Getting records that contain newlines would be a bit trickier.

quesera2y ago

Heaven help you if you cat the file in a shell, though!

BenjiWiebe2y ago

I think IFS=$'\a' works too.

1 more reply

PeterisP2y ago

wlesieutre2y ago

If only the ASCII 31 "unit separator" were well supported

nicbou2y ago

I also like that Preview can display CSV

breadwinner2y ago

Avro is similar. Last I checked there are two Avro libs for C# and each has its own issues.

arp2422y ago

I skimmed their docs a bit: https://parquet.apache.org/docs/

I would not look forward to implementing that.

This is probably why good libraries are rare.

munchler2y ago

Thanks for the link. I couldn't even get past this part:

> Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem

Nope and nope.

arp2422y ago

Jensson2y ago

> But Protobuf is not a splitable format, which means it is not Big Data friendly, unlike Parquet, Avro and CSV.

If you mean you can't split a single protobuf, sure, but you don't need to do that, it is like trying to split a single csv line, doesn't mean csv isn't splittable.

Edit: Maybe some of that tooling to work with protobufs are internal only or not a part of the external protobuf packages though.

mordae2y ago

> designed like they are

breadwinner2y ago

msla2y ago

> there are no data tools that don't support CSV format.

They support CSV but not your CSV.

For example, how does quoting work? Does quoting work?

rovr1382y ago

I was working with a vendor’s csv recently…

They had never had a customer do X on Y field, so they never quoted it nor added code to quote it if needed..

Of course, we did X in one entry. Took me too long to find that which obviously messed up everything after.

neonsunset2y ago

kerkeslager2y ago

> Google Protobuf is a lot closer than Parquet or Avro. But Protobuf is not a splitable format, which means it is not Big Data friendly, unlike Parquet, Avro and CSV.

Eh, I don't think that's the problem. If that was the problem, there are a zillion ways to chunk files; .tar is probably the most ubiquitous but there are others.

I think Unicode Separated Values (USV) is a much better alternative and as another poster mentioned, I hope it takes off.

rkaveland2y ago

If you're exporting files for machine consumption, please consider using something more robust than CSV.

smcin2y ago

Well what would be a more accurate title? "CSV format should only be for external interchange or archival; columnar formats like Parquet or Arrow better for performance"?

There's also an implicit assumption to each use-case about whether the data can/should fit in memory or not, and how much RAM a typical machine would have.

rkaveland2y ago

> Well what would be a more accurate title? "CSV format should only be for external interchange or archival; columnar formats like Parquet or Arrow better for performance"?

This might be an interesting thing to follow up later, but would require a lot more work.

cryptonector2y ago

> I see now that the title is too controversial, I should have toned that down.

fifilura2y ago

Congratulations for getting the article upvoted and don't be too hard on yorself.

mannyv2y ago

I got a parquet file once and I was like WTF is this format?

There are likely relationships and constraints there that you'll have to hack around - which is harder to do because the parquet tools sort of suck/aren't as flexible.

With CSV you can hack around any problem in the ETL process.

rietta2y ago

barrkel2y ago

CSV is not well-defined. Data in the wild doesn't even agree that it's comma separated.

String encoding? Dates? Formatted numbers? Booleans (T/F/Y/N/etc)? Nested quotes? Nested CSV!?

Eddy_Viscosity22y ago

aragilar2y ago

The problem is is that it is simple until it's not, then you have problems. You need to validate immediately after export, and work out what happens when that fails.

1 more reply

wruza2y ago

Intermediate systems like Excel will break anything, they aren’t constrained to CSV. Excel screws up at the level of a cell value, not at the file format.

barrkel2y ago

On the topic of nested CSV, three approaches:

- treat it as a join, and unroll by duplicating non-nested CSV data in separate rows for every element in the nested CSV

- treat it as a projection, have an extraction operator to project the cells you want

- treat it as text substitution problem; I've seen CSV files where every line of CSV was quoted like it was a single cell in a larger CSV row

You get nested CSV because upstream systems are often master/detail or XML but need to use CSV because everybody understands CSV because it's such a simple file format. Good stuff.

croes2y ago

Seems more like a problem of the exporting and importing software and not the format.

CSV gives you the freedom of choice how to write the data.

You need the french data format? You can write the french date format. You need # as separator? You can use # as separator.

Sure you can't read any file without knowing it's specification but that's not the usecase of CSV.

CSV is for data transfer between systems which know each other.

joncrocks2y ago

Indeed. It's like 'text file' - there are many ways to encode and decode these.

Add another munging the to list, ids that 'look like numbers' e.g. `0002345` will get converted to `2,345`. Better be sure to pre-pend ' i.e. `'0002345`

SAI_Peregrinus2y ago

gregw22y ago

And “do you have a header row”? And null/nil/nul/blank specification/expectation?

fragmede2y ago

Yeah but we're not in the dark ages of computers anymore. Export to Sqlite database instead.

barrkel2y ago

1 more reply

Macha2y ago

> Parsing CSV correctly is not hard, you just can’t use split and be done with it.

Parsing RFC-compliant CSVs and telling clients to go away with non-compliant CSVs is not hard.

Parsing real world CSVs reliably is simply impossible. The best you can do is heuristics.

How do you interpret this row of CSV data?

1,5,The quotation mark "" is used...,2021-1-1

What is the third column? The RFC says that it should just be literally

> The quotation mark "" is used...

But the reality is that some producers of CSVs, which you will be expected to support, will just blindly apply double quote escaping, and expect you to read:

The quotation mark " is used...

Or maybe you find a CSV producer in the wild (let's say... Spark: https://spark.apache.org/docs/latest/sql-data-sources-csv.ht...) that uses backslash escaping instead.

wruza2y ago

The RFC 4180 says:

Macha2y ago

The point is your clients and partner firms don't actually care what RFC 4180 says, they just expect you to deal with whatever their CSV library spits out.

mrgoldenbrown2y ago

4180 is just one of many CSV variants. You can't simply examine a CSV and determine for sure which flavor of file you were given, you can only guess.

yetihehe2y ago

For added fun, last column should be 1-2-2021.

pezezin2y ago

At least in that case you know that the last part is the year. It is much funnier when you encounter something like 3-4-17 and you don't know if it is d/m/y, m/d/y, or y/m/d.

martinflack2y ago

Oh that's easy, it's Janreburary Firscond 2021.

rietta2y ago

If you are writing a commercial, enterprise solution that needs to get it right on its own every time, then that becomes a much bigger project.

And yes, do support Excel XLSX format if you can. There are multiple third party libraries to do this in various languages of various quality.

As a developer, I have made a lot of my living dealing with this stuff. It can be a fun challenge, frustrating at time, but in the end as professionals we solve what we need to to get the job done.

dtech2y ago

rietta2y ago

Localization is a thing. None of this is a show stopper. Subclasses and configuration screen for import and export.

Macha2y ago

1 more reply

rrr_oh_man2y ago

VMG2y ago

FTA:

* What does missing data look like? The empty string, NaN, 0, 1/1-1970, null, nil, NULL, \0?

* What date format will you need to parse? What does 5/5/12 mean?

And let me add my own question here:

what is the actual delimiter? Do you support `,`, `;` and `\t`?

fredguth2y ago

What is the encoding of the text file? UTF8, windows-1252?

What is the decimal delimiter “.”, “,”?

Most csv users don’t even know they have to be aware of all of these differences.

SAI_Peregrinus2y ago

1 more reply

cm21872y ago

I am aware you can import a csv file in excel by manually defining the column types but few people use that.

I'd be fine with an extension of the csv format with one extra top row to define the type of each column.

JumpCrisscross2y ago

> aware you can import a csv file in excel by manually defining the column types but few people use that

And what fraction of those users would be able to anything with another format?

> an extension of the csv format with one extra top row to define the type of each column

If the goal is foolproof export to Excel, use XLSX.

frizlab2y ago

orwin2y ago

I will hard disagree here. Always have has a clrf issue or another weirdness come up.

I had once an issue with json (well, not one created by me) , and it was clearly a documentation mistake. And I hate json (I'm an XML proponent usually). Csv is terrible.

mr_toad2y ago

> Parsing CSV correctly is not hard

Parsing the CSV you have in front of you is not hard (usually). Writing a parser that will work for all forms of CSV you might encounter is significantly harder.

alserio2y ago

Excel and data precision are really at odds. But it might be interesting to see what would happen if excel shipped with parquet import and export capabilities

kalleboo2y ago

> Honestly CSV is a very simple, well defined format, that is decades old and is “obvious”

If a more complex, non-obvious format was required, instead of "easy I'll just concatenate values" they might actually decide to put engineering into it (or use a library)

ravenstine2y ago

> Parsing CSV correctly is not hard, you just can’t use split and be done with it.

And yet you can anyway if you are confident that your CSV won't contain anything that would mess it up.

delfinom2y ago

CSV is so simple that if someone forgets to send you escaped CSV, you can and should simply smack them with a shoe.

himinlomax2y ago

> well defined format

No.

prepend2y ago

CSV is very durable. If I want it read in 20 years, csv is the way to go until it’s just too big to matter.

Of course there are better formats. But for many use cases friends encourage friends to export to CSV.

josephg2y ago

Eh. I much prefer to produce and consume line delimited JSON. (Or just raw JSON). Its easy to parse, self descriptive and doesn't have any of CSV's ambiguity around delimiters and escape characters.

Its a little harder to load into a spreadsheet, but in my experience, way easier to reliably parse in any programming language.

brian-bk2y ago

ianburrell2y ago

That is only if the JSON uses objects. JSON arrays map much better to CSV. In that case, only adding brackets to the front and end of each line.

josephg2y ago

Compression will make the size overhead disappear instantly.

Looking at the downvotes I can see that its a controversial choice. But I stand by it.

VHRanger2y ago

If you send someone JSON there's no guarantee the data is tabular, or even formatted, though

ravenstine2y ago

Better something that can be parsed than not parsable at all?

1 more reply

danirod2y ago

Friends don't let friends export to CSV -- in the data science field.

xupybd2y ago

Exactly Excel is the UI they know. This trumps every technical argument you can come up with. People don't want to throw out 20 years of experience with a tool to use your custom UI.

mrgoldenbrown2y ago

_trampeltier2y ago

CSV wins because its universal and very simple. With an editor like Notepad++ and the CSV plugin, reformating, like change date format, is very easy and even with colored columns.

eska2y ago

orthoxerox2y ago

- different physical types to store timestamps: INT96 vs INT64

- different ways to interpret timestamps before tzdb (current vs earliest tzdb record)

- different ways to handle proleptic Gregorian dates and timestamps

- decimal data type was written differently, as a byte array in older versions and as int/byte array/binary in the newer ones

- Hadoop ecosystem doesn't support decimals longer than 38 digits, but the file format supports them

JumpCrisscross2y ago

> One of the infurating things about the format is that things often break in ways that tools can't pick up and tell you about

This line is emblematic of the paradigm shift LLMs have brought. It’s now easier to build a better tool than change everyone’s behaviour.

> You give up human readable files, but

What are we even doing here.

MilStdJunkie2y ago

If I am working system to system? That's a different requirement, a requirement that is apparently where the OP author is coming from.

jitl2y ago

We’re doing data pipelines. I would rather my data pipeline go 10x faster with Parquet than be able to human read a 30gb CSV file.

th0ma52y ago

The vast majority of the thread is missing this point. The CSV abuse is out of control! Lol

rzzzt2y ago

How do you verify it? What happens next?

quesera2y ago

Every single use I've ever seen of CSV would be improved by the very simple change to TSV.

Even Excel can handle it.

It is far safer to munge data containing tabs (convert to spaces, etc), than commas (remove? convert to dots? escape?).

The better answer is to use ASCII separators as Lyndon Johnson intended, but that turns out to be asking a lot of data producers. Generating TSV is usually easier than generating CSV.

jillesvangurp2y ago

croes2y ago

Don't change the data by replacing the separator. If the data contains a comma, tab etc. I want to receive a comma, tab etc.

With proper escaping the separator doesn't matter.

After that csv generating is pretty easy.

quesera2y ago

Right, the problem with CSV is that it's often generated incorrectly.

Commas are just too common of a character to be a good delimiter.

Tabs are an improvement, because usually tabs are just whitespace, which is usually arbitrary, and never needs to be balanced.

But tabs are not perfect either. Just a vast improvement for most data, and a functional compromise for humans.

croes2y ago

BTW is Excel better in reading UTF8 files?

Last time I checked it can't read it without errors if you just open the file with it without using the import function an explicitly select UTF8.

Kwpolska2y ago

The trick is using UTF-8 with BOM.

chirau2y ago

Commas are, by far, better delimeters than tabs in the grand scheme of things and with both expert and regular users considered.

quesera2y ago

I disagree. The end user doesn't need to know the difference between 4 spaces and a tab. Tabs are just whitespace.

Tabs are uncommon but convenient whitespace. Commas are extremely common content. Tabs are a vastly better delimiter.

TSV certainly is not perfect. But it solves the major problems for 95% of CSVs, and it's just as convenient for humans.

I do agree that one should not arbitrarily munge content. But note that HTML does munge whitespace, and we've never suffered meaningfully for it.

chirau2y ago

I don't think you fully understood what i was saying.

If a regular user had rows like this

Adam Smith 27 WA

JonathanBoyd 23 NC

aronhegedus2y ago

My takeaway is that csv has some undefined behaviours, and it takes up space.

I like that everyone knows about .csv files, and it's also completely human readable.

So for <100mb I would still use csv.

nradov2y ago

senknvd2y ago

I don't think RFC 4180 differentiates between an empty string and a null value. As long as you add a check that all string columns are free of empty values before writing you should be good.

I think in polars it's

    df.filter(pl.col(pl.Utf8).str.len_bytes() == 0).shape[0] == 0

although there's probably a better way to write this.

nradov2y ago

Well I would consider differentiation between empty string versus null as simply being out of scope for CSV rather than undefined behavior. It was never intended as a complete database dump format.

thayne2y ago

And the application doesn't try to convert the cells into non-string data types like numbers, dates, etc.

nradov2y ago

Converting strings into other data types is out of scope for CSV, not really undefined behavior. The type conversions happen at a later stage of the import process.

1 more reply

SPBS2y ago

onethumb2y ago

CSV has some limits and difficulties, but has massive benefits in terms of readability, portability, etc.

I feel like USV (Unicode Separated Values) neatly improves CSV while maintaining most of its benefits.

https://github.com/sixarm/usv

culebron212y ago

Working with the same kind of data in Rust, I see everything saved and loaded in CSV is lightning fast. The only thing you may miss is indexing.

kylebarron2y ago

Sorry, this is not true _at all_ for geospatial data.

A quick benchmark [0] shows that saving to GeoPackage, FlatGeobuf, and GeoParquet are roughly 10x faster than saving to CSV. Additionally, the CSV is much larger than any other format.

[0]: https://gist.github.com/kylebarron/f632bbf95dbb81c571e4e64cd...

culebron212y ago

And here's my quick benchmark, dataset from my full-time job:

  > import geopandas as gpd
  > import pandas as pd
  > from shapely.geometry import Point

  > d = pd.read_csv('data/tracks/2024_01_01.csv')
  > d.shape
  (3690166, 4)
  > list(d)
  ['user_id', 'timestamp', 'lat', 'lon']

  > %%timeit -n 1
  > d.to_csv('/tmp/test.csv')
  14.9 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > d2 = gpd.GeoDataFrame(d.drop(['lon', 'lat'], axis=1), geometry=gpd.GeoSeries([Point(*i) for i in d[['lon', 'lat']].values]), crs=4326)
  > d2.shape, list(d2)
  ((3690166, 3), ['user_id', 'timestamp', 'geometry'])

  > %%timeit -n 1
  > d2.to_file('/tmp/test.gpkg')
  4min 32s ± 7.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > %%timeit -n 1
  > d.to_csv('/tmp/test.csv.gz')
  37.4 s ± 291 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > ls -lah /tmp/test*
  -rw-rw-r-- 1 culebron culebron 228M мар 26 21:10 /tmp/test.csv
  -rw-rw-r-- 1 culebron culebron  63M мар 26 22:03 /tmp/test.csv.gz
  -rw-r--r-- 1 culebron culebron 423M мар 26 21:58 /tmp/test.gpkg

CSV saved in 15s, GPKG in 272s. 18x slowdown.

For such datasets, use CSV, don't abuse indexed formats. (Unless you store it for a long time and actually use the index for spatial search, multiple times.)

1 more reply

smcin2y ago

> for geospatial data... GeoPackage was "the Format of the Future" 8 years ago

What's the current consensus? Can you link to a summary article?

(Some still say GeoPackage is: https://mapscaping.com/shapefiles-vs-geopackage/ )

culebron212y ago

The problem is that in data science you want whole datasets to be atomic, to have reproducible results. So you don't care much of these sub-dataset operations.

1 more reply

nsjames2y ago

I think the sad reality there is that it's become "the" format that users expect, and more importantly, it's what's integrated into the majority of peripheral services and tools.

Like JSON.

pyr0hu2y ago

CSV is just too simple and "user-friendly".

bell-cot2y ago

2devnull2y ago

Moreover, I myself like being able to open broken csv files in a text editor, to find nulls and other problematic junk.

addminztrator2y ago

Good thing I don't have friends then because I'm exporting to csv, it's simple and it works

kembrek2y ago

If one could export to Parquet from Microsoft Excel, I think this would be a goer. Until such time, it seems likely many will stick with CSVs.

efxhoy2y ago

v1 or v2?

nmz2y ago

If you had the option to use whatever you want, and you're already sacrificing not using a text editor, why would you use parquet instead of the myriad of choices.

divan2y ago

I suspect that not everybody here work exclusively with huge datasets and well-defined data pipelines.

On a practical side, if I want to follow suggestions, how do I export to Avro from Numbers/Excel/Google Sheets?

Black616Angel2y ago

I never liked articles about how you should replace CSV with some other format while pulling some absolutely idiotic reasons out of their rear...

1. CSV is underspecified Okay, so specify it for your use case and you're done? E.g use rfc3339 instead of the straw-man 1-1-1970 and define how no value looks like, which is mostly an empty string.

So in the end: Use it for small data exports where you can specify everything you want or like everywhere, where you can import data, because most software takes CSV as input anyway.

For lots of data use something else.

Friends don't let friends write one-sided articles.

HayBale2y ago

Like NHS with their post covid data due to excel limitations or gene name conversion problems in sci journals.

Same happens with stupid amount of laboratory management things or bioinformatics tools.

Honestly obviously the article is biased but we should at least think about moving away from csv in non customer facing fronts.

Small files? Json Big files? SQLlite or parquet.

alserio2y ago

2devnull2y ago

You can use excel as the lingua franca. Also give them a row/col counts. Most problems solved in two easy steps.

javcasas2y ago

For lots of data zip the csv. For REALLY lots of data, think something different.

lolive2y ago

The reason why USV did not use the proper ASCII codes for field separator and record separator is a bit too pragmatic for me…

https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...

PLenz2y ago

Everything reads it, everything writes it. CSV is the one true data format to which everything else will eventually be converted to by users.

WatchDog2y ago

The only real problems I ever have with CSV, is when excel is involved.

benob2y ago

Can Parquet be read/parsed in almost every programming language with very little effort?

bluenose692y ago

I just tried it in R. The relevant package seems to be "arrow", so I did

    install.packages("arrow")

and then I did

    ?read_parquet

to get an example. I tried the example, and got the error message as follows. This sort of error is really quite uncommon in R. So my answer to the "with little effort" is "no", at least for R.

    > tf<-tempfile()
    > write_parquet(mtcars, tf)
    Error in parquet___WriterProperties___Builder__create() : 
  Cannot call parquet___WriterProperties___Builder__create(). See https://arrow.apache.org/docs/r/articles/install.html for help installing Arrow C++ libraries.

2devnull2y ago

Arrow going back and forth between r/python can be a catch too iirc.

vesinisa2y ago

addandsubtract2y ago

More importantly, can you open parquet files in Excel?

JumpCrisscross2y ago

No, you have to convert it to CSV first [1]. Or install a driver [2].

[1] https://www.gigasheet.com/post/how-to-open-parquet-file

[2] https://www.cdata.com/kb/tech/parquet-odbc-excel-query.rst

jjgreen2y ago

An article promoting parquet over CSV. Fair enough, but parquet has been around for a while and still no support in Debian. Is there some deep and dark reason why?

Cheer21712y ago

jjgreen2y ago

  You have searched for packages that names contain parquet in suite(s) bookworm,   
  all sections, and all architectures.

  Sorry, your search gave no results