But, if you pick the right parser and generator, then you're ok with using it.
e.g. the example from my comment on the last CSV discussion (https://news.ycombinator.com/item?id=28223719)
What variant is this:
1,5,Here is a string "" that does stuff,2021-1-1
What is the value of the third column?Is this a CSV file without quoting? Then it's
Here is a string "" that does stuff
Or is it a CSV file with double quote escaping? Then it's Here is a string " that does stuffStill probably worth using a library, but it isn't a source of problems in my experience
In my experience one of the biggest barriers I run into -- and the primary reason I hate using CSV -- is Microsoft Excel. It misinterprets numbers as dates, it convers big numeric identifiers to exponents, and more. Even merely opening a RFC4180-compliant file and saving it changes the data, and even Excel itself will often have a different misinterpretation of the de file.
If humans never used Excel for CSV, it would be a viable format. At the same time in most cases where humans aren't in the loop (machine-to-machine communications), there's better formats. You could spec "RFC4180 CSV" and hope no developer just sees the "CSV" and assumes they understand. Or specify something like a JSON streaming format and avoid a whole lot of headache.
I avoided CSV for quite a while because I had excel-vs-CSV compatibility concerns like this.
However, when I tested this for myself a few years back, Excel output to my surprise was rfc4180 or darn near it (it might use CRLF rather than LF?) It emitted commas and quotes the same way as the rfc for all the test cases I checked.
That said, I agree with you Excel is problematic as an input source. Usually the problems are the humans who touch the data in excel, but what I’ve found is the automation problems tend to be with Excel parsing and interpreting incoming data (before it goes to CSV.) Exponents, trimming leading zeros, etc. as you say. But if the data is confirmed good in excel before being emitted, the CSV it emits is decent.
Counterexamples welcome.
Parquet is great, don’t get me wrong.
To read a parquet file in Python, you need Apache Arrow and Pandas. And literally the second result for "parquet python libraries" is an article titled "How To Read Parquet Files In Python Without a Distributed Cluster".
I remember dealing with Parquet file for a job a while back and this same question came up: Why isn't there a simpler way, for when you're not in the data science stack and you just need to convert a parquet file to csv/json/read rows? Is is a limitation of the format itself?
if you choose pipe ok, now you have to make sure nobody typed a pipe into the input field or spreadsheet, and you cannot store unix commands
if you choose tab, ok, now people will get confused when they try to edit the text file to replace tabs with spaces, and now you have trouble putting code snippets into data fields because they have tabs.
this is the problem and it's why xml/json exist.
in my particular domain, tab separated works pretty well but in a general context of the world at large, i feel like JSON has reasons it exists.
I wrote an article about tabular formats and their strengths and weaknesses here: https://successfulsoftware.net/2022/04/30/why-isnt-there-a-d...
The resulting HN discussion is here: https://news.ycombinator.com/item?id=31220841
Can't you just do this?
{
"columns": ["col1", "col2", "col3"],
"data": [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
]
}
That's valid JSON but it's human-readable and human-editable rows of comma-separated data, just like CSV.The limitations of CSV are certainly worth considering and, in the instances you mentioned, it may be not be worth using CSV. (If you are going to be using a more complex parser anyway, you may as well using a format that is better defined and where you are less likely to encounter edge cases.) That being said, there remain many cases where CSV is far more efficient and far less error prone.
What I've found to work well is to just % encode your delimiter, the new line character, and the '%' character. Basically every language has utilities for this.
Doesn't solve the issue with accepting outside files though. You have to be pessimistic with those regardless.
CSV in general is problematic as there is no standard (RFC 4180 is not). In certain contexts this surely can be good solution but definitelly not good in general scenario.
From the RFC itself:
Status of This Memo
This memo provides information for the Internet community. It does
not specify an Internet standard of any kind. Distribution of this
memo is unlimited.Sure, for a local data file or something where it's nice to be human-readable-ish, CSV can be a better choice than JSON (assuming you use a library for all the edge cases and string escapes.) If you really want a super-small and fast serialization, that's what protobuf is for.
`r#csv2md('filanme.csv')`
The editor will convert Markdown to XHTML in the preview panel (in real time), then ConTeXt can typeset the XHTML into a PDF file in various styles.[2][3] This avoids spending time fighting with table formatting/consistency in certain word processors while storing the data in a machine-friendly format. (Thereby upholding the DRY principle because the data can have a single source of truth, as opposed to copying data into documents, which could go stale/diverge.)Using JSON would be possible, but it's not as easy to convert into a Markdown table.
[0]: https://github.com/DaveJarvis/keenwrite
[1]: https://github.com/DaveJarvis/keenwrite/blob/main/R/csv.R#L3...
If you want streaming: https://jsonlines.org/
It seems to me that you could write a JSON streaming parser that, if the outer element is an Array, reads and outputs/processes one Array element at a time as its JSON value. Yeah, you can't get the array length, and if there's a parse error somewhere down the line then you have to figure out how to deal with that (a non-streaming parser would have rejected the whole input), but that's kind of inherent in using a streaming parser. The upside is that you can work with any valid JSON.
Sure, if you're interoperating with shell tools, and don't have `jq` available, newline-delimited JSON might be helpful. But on the other hand, just install `jq`, dummy!
Excel compatibility really sux. And Excel is most used one by large. You really can't double click it effectivelly, as everything will be shown as generic type, so you have to mess up with wizard which is also half baked. I have to create tutorials for that for each service using it.
[
["productId", "quantity", "customerId"],
["5710031efdfe", 1, "8fe96b88"],
["479cd9744e5c", 2, "526ba6f5"]
]
This style also works well with jsonlines a sibling comment mentioned. Of course my favorite is MessagePack (or CBOR) using similar styles. MsgPack can be as small as gzipped JSON. :)It's similar to JSON Vs INI files for config files.
On a different note I wouldn't nest JSON in a CSV column. I'd delimit with a pipe or something the split string on that. Much simpler if you're in control of the data.
Are you just trusting that the types coming in are going to be the correct ones? What happens if someone sends you `{“foo”: “10”}` instead of `{“foo”: 10}`? Do you validate with a schema up front? In which case your code already needs to know what types it expects and can convert them. Or are you letter the incorrect types run through your system until it hits a type error at runtime somewhere?
No idea really, but if you're using a JSON parsing library then that is going to automatically convert for the data types. Which, provided you trust the data, saves you a job.
The author didn't think of any examples with even a bit more complexity. If you have 2-level object nesting, now what?
If size is really the issue but you still want schema enforcement protobuf is the way to go.
As the post acknowledges right about where you stopped skimming.
And you almost always have/need a nested structure, no?
No.
As I noted in my own comment. Ironic to accuse me of skimming the original post when you couldn't even read my two sentences.
Isn't protobuf basically CSV but with good libraries at the interface point and standards around how to deserialize the streams?
0x1E Record Separator
0x1F Unit SeparatorYou also have 2 more the group separator and the file separator. so you could represent a tree with it.
The size complaint is overblown, as repeated fields are compressed away.
As other folks rightfully commented, csv is a mine field. One should assume every CSV file is broken in some way. They also don't enumerate any of the downsides of CSV.
What people should consider is using formats like Avro or Parquet that carry their schema with them so the data can be loaded and analyzed without have to manually deal with column meaning.
If somebody asked me to support this format after you'd left the company I'd quit on the spot. This frankenformat is 100% premature optimization, non standardised, requires custom parsers (which are potentially inefficient and may negate the network performance from having to parse both json and csv) and is potentially very difficult to maintain and debug (no syntax highlighters or rest like posting tools)
Just either use GRPC or JSON with regular network level gzip encoding.
["hello",5,false,1,2,2.334,null]
["world",12,true,1,2,2.334,null]
Parquet is a columnar format that is compressed, typed, efficient for columnar queries, append-friendly (though not rewritable), and is a natively supported format for Apache Spark. Parquet libraries are now widely available for most languages (didn't used to be the case, but now they are). I query Parquet files with DuckDB in Python and it blazes through GBs of data in seconds.
At work, the canonical format is TSV (tab separated values) which despite being human-readable, is huge, inefficient to query and does not support data types.
When I have to work with large TSV files (10GB or larger), I first convert them into 500MB Parquet files. The latter are faster, smaller and less prone to type errors. Because columnar formats like Parquets are indexed, I can do complex operations like joins, window functions, aggregations on them in a performant way, while any similar operation on TSV files will trigger a table scan each time.
I recently ran a Spark job on a very large TSV file which took over 8 hours and timed out. A Spark job on the same data represented in Parquet completed in 5 minutes.
In my country the decimal separator is comma, instead of punctuation. This causes problems when importing and exporting with this "format".
Just few weeks ago I had fun times working with API returning CSV in unknown encoding. Hopefully they will never make changes (you cannot always trust headers). Ah and i do love when CSV is missing headers and someone adds data into middle.
Of course some of these issues can be avoided by doing the things "right". Sadly you cannot trust this in real life. People write ugly structures in JSON, but at least you can validate results..
Then there's the issue of encoding, as that is also not the same across locales. Then you get a CSV with the BOM characters up front or some French accents represented as ? because of incorrect encoding parsing / saving.
At least JSON doesn't have any of these things. Standardized strings, and standardized number format.
I see a lot of complaining about "no standard" in this thread, but the way I've used them, it's been fine. I just use Haskell's cassava. If human produce them with Excel/LibreOffice, I never have issues on the ingestion end.
csv is just a sequence of 2d byte arrays. probably avoid if dealing with heterogeneous external data. possibly use if dealing with homogeneous internal data.
1. https://github.com/nathants/bsv/tree/55c90797283f5e37f91bbb6...
U+241D, U+241E, U+241F
SOH (U+01), US (U+1F), RS (U+1E), GS (U+1D), FS (U+1C)
[\n
{ ... },\n
{ ... },\n
{ ... },\n
...\n
]\nWith this simple trick you can stream easily..