But, if you pick the right parser and generator, then you're ok with using it.
e.g. the example from my comment on the last CSV discussion (https://news.ycombinator.com/item?id=28223719)
What variant is this:
1,5,Here is a string "" that does stuff,2021-1-1
What is the value of the third column?Is this a CSV file without quoting? Then it's
Here is a string "" that does stuff
Or is it a CSV file with double quote escaping? Then it's Here is a string " that does stuffStill probably worth using a library, but it isn't a source of problems in my experience
In my experience one of the biggest barriers I run into -- and the primary reason I hate using CSV -- is Microsoft Excel. It misinterprets numbers as dates, it convers big numeric identifiers to exponents, and more. Even merely opening a RFC4180-compliant file and saving it changes the data, and even Excel itself will often have a different misinterpretation of the de file.
If humans never used Excel for CSV, it would be a viable format. At the same time in most cases where humans aren't in the loop (machine-to-machine communications), there's better formats. You could spec "RFC4180 CSV" and hope no developer just sees the "CSV" and assumes they understand. Or specify something like a JSON streaming format and avoid a whole lot of headache.
I avoided CSV for quite a while because I had excel-vs-CSV compatibility concerns like this.
However, when I tested this for myself a few years back, Excel output to my surprise was rfc4180 or darn near it (it might use CRLF rather than LF?) It emitted commas and quotes the same way as the rfc for all the test cases I checked.
That said, I agree with you Excel is problematic as an input source. Usually the problems are the humans who touch the data in excel, but what I’ve found is the automation problems tend to be with Excel parsing and interpreting incoming data (before it goes to CSV.) Exponents, trimming leading zeros, etc. as you say. But if the data is confirmed good in excel before being emitted, the CSV it emits is decent.
Counterexamples welcome.
Parquet is great, don’t get me wrong.
To read a parquet file in Python, you need Apache Arrow and Pandas. And literally the second result for "parquet python libraries" is an article titled "How To Read Parquet Files In Python Without a Distributed Cluster".
I remember dealing with Parquet file for a job a while back and this same question came up: Why isn't there a simpler way, for when you're not in the data science stack and you just need to convert a parquet file to csv/json/read rows? Is is a limitation of the format itself?
df = pandas.read_parquet(‘foo.parquet’)
df.to_csv(‘foo.csv’)
df.to_json(‘foo.json’)
(no sarcasm)—how could it be simpler than that? What problems have you encountered that make it unusable?FWIW, in my experience at a "data analytics platform" company, it's reasonably popular for data-heavy workflows since Parquet is well-defined, and file sizes (especially as the amount of data grows) are a fraction of their CSV equivalents.
> Is it a limitation of the format itself?
I don't think so. In other languages, you can generally read/write Parquet files without a ton of dependencies (e.g. https://github.com/xitongsys/parquet-go).
This is the opposite of my experience.
> To read a parquet file in Python, you need Apache Arrow and Pandas.
Or DuckDB.
import duckdb
df = duckdb.query("select * from 'a.parquet'")
Want to look inside a Parquet file? Use Visidata. vd a.parquet
> I remember dealing with Parquet file for a job a while back and this same question came up: Why isn't there a simpler way, for when you're not in the data science stack and you just need to convert a parquet file to csv/json/read rows? Is is a limitation of the format itself?Do you consider Pandas a "data science" stack? To me, it's just a library like any other that makes it easy to work with tabular data. Even for CSV, there is csvreader (usually not a good idea to deal with CSV by hand). Outputting to CSV is literally a one liner in Pandas or DuckDB.
import pandas as pd
# output to CSV
pd.read_parquet("a.parquet").to_csv("a.csv")
# output to JSON (choose from any number of orientations)
pd.read_parquet("a.parquet").to_json(orient="table")
# read rows
for row in pd.read_parquet("a.parquet").itertuples():
print(row)if you choose pipe ok, now you have to make sure nobody typed a pipe into the input field or spreadsheet, and you cannot store unix commands
if you choose tab, ok, now people will get confused when they try to edit the text file to replace tabs with spaces, and now you have trouble putting code snippets into data fields because they have tabs.
this is the problem and it's why xml/json exist.
in my particular domain, tab separated works pretty well but in a general context of the world at large, i feel like JSON has reasons it exists.
Control characters. Like ctrl-A and stuff. Almost nobody has them in their data.
I wrote an article about tabular formats and their strengths and weaknesses here: https://successfulsoftware.net/2022/04/30/why-isnt-there-a-d...
The resulting HN discussion is here: https://news.ycombinator.com/item?id=31220841
Can't you just do this?
{
"columns": ["col1", "col2", "col3"],
"data": [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
]
}
That's valid JSON but it's human-readable and human-editable rows of comma-separated data, just like CSV.The limitations of CSV are certainly worth considering and, in the instances you mentioned, it may be not be worth using CSV. (If you are going to be using a more complex parser anyway, you may as well using a format that is better defined and where you are less likely to encounter edge cases.) That being said, there remain many cases where CSV is far more efficient and far less error prone.
What I've found to work well is to just % encode your delimiter, the new line character, and the '%' character. Basically every language has utilities for this.
Doesn't solve the issue with accepting outside files though. You have to be pessimistic with those regardless.
CSV in general is problematic as there is no standard (RFC 4180 is not). In certain contexts this surely can be good solution but definitelly not good in general scenario.
From the RFC itself:
Status of This Memo
This memo provides information for the Internet community. It does
not specify an Internet standard of any kind. Distribution of this
memo is unlimited.Wouldn't that be impossible, given that parsers have to accept all kind of bizarro CSV flavors? Maybe more importantly, do you know of a single program or single CSV library that doesn't support reading or writing CSV as defined by the RFC?
* Uncategorised
* Historic
* Experimental
* Informational
* Best Current Practice
* Proposed Standard
* Draft Standard
* Internet Standard
Once an RFC reaches "Internet Standard" it is given a special designation, e.g. STD-63 is the standards designation for RFC-3629: UTF-8 < https://www.rfc-editor.org/info/std63 >. See https://www.rfc-editor.org/standardsBeing an "Internet Standard" is kinda special, but not especially so. For example, IMAP4, originally specified in RFC-3501 in March 2003, updated many times since, and revised in RFC-9051 in August 2021, is still a "Proposed Standard" without an STD designation, nearly 20 years and dozens of interoperable implementations later.
"Rough consensus and running code" is how things get done.
RFC-4180 is plenty good enough a "standard" for people to decide to interoperate over. They just have to decide to do so.
(Note also that HTML5 is not an "Internet Standard" according to the IETF et al. The last version to get an RFC was HTML 2 in RFC-1866, designated "Historic". And interoperability was an issue for a while with later versions of HTML during the "Best viewed in Internet Explorer/Netscape Navigator" wars. To get interoperability like we eventually did, you don't need an "Internet Standard"; you just need implementers who want to interoperate, and are willing to favour it over lock-in, and even over strict backwards-compatibility.)
(Also, the "and nothing else" clause in your comment confuses me. Why not support other formats/variants also? "Be liberal in what you accept" is certainly something that you probably want to avoid if you're designing a new format/protocol that no-one else is using yet, but if you're working with a decades-old format that was traditionally poorly-specified, with millions of documents out in the wild, it's probably the best way to allow existing users to move forward.)
Sure, for a local data file or something where it's nice to be human-readable-ish, CSV can be a better choice than JSON (assuming you use a library for all the edge cases and string escapes.) If you really want a super-small and fast serialization, that's what protobuf is for.
`r#csv2md('filanme.csv')`
The editor will convert Markdown to XHTML in the preview panel (in real time), then ConTeXt can typeset the XHTML into a PDF file in various styles.[2][3] This avoids spending time fighting with table formatting/consistency in certain word processors while storing the data in a machine-friendly format. (Thereby upholding the DRY principle because the data can have a single source of truth, as opposed to copying data into documents, which could go stale/diverge.)Using JSON would be possible, but it's not as easy to convert into a Markdown table.
[0]: https://github.com/DaveJarvis/keenwrite
[1]: https://github.com/DaveJarvis/keenwrite/blob/main/R/csv.R#L3...
If you want streaming: https://jsonlines.org/
It seems to me that you could write a JSON streaming parser that, if the outer element is an Array, reads and outputs/processes one Array element at a time as its JSON value. Yeah, you can't get the array length, and if there's a parse error somewhere down the line then you have to figure out how to deal with that (a non-streaming parser would have rejected the whole input), but that's kind of inherent in using a streaming parser. The upside is that you can work with any valid JSON.
Sure, if you're interoperating with shell tools, and don't have `jq` available, newline-delimited JSON might be helpful. But on the other hand, just install `jq`, dummy!
Excel compatibility really sux. And Excel is most used one by large. You really can't double click it effectivelly, as everything will be shown as generic type, so you have to mess up with wizard which is also half baked. I have to create tutorials for that for each service using it.
[
["productId", "quantity", "customerId"],
["5710031efdfe", 1, "8fe96b88"],
["479cd9744e5c", 2, "526ba6f5"]
]
This style also works well with jsonlines a sibling comment mentioned. Of course my favorite is MessagePack (or CBOR) using similar styles. MsgPack can be as small as gzipped JSON. :)It's similar to JSON Vs INI files for config files.
On a different note I wouldn't nest JSON in a CSV column. I'd delimit with a pipe or something the split string on that. Much simpler if you're in control of the data.
Are you just trusting that the types coming in are going to be the correct ones? What happens if someone sends you `{“foo”: “10”}` instead of `{“foo”: 10}`? Do you validate with a schema up front? In which case your code already needs to know what types it expects and can convert them. Or are you letter the incorrect types run through your system until it hits a type error at runtime somewhere?
No idea really, but if you're using a JSON parsing library then that is going to automatically convert for the data types. Which, provided you trust the data, saves you a job.
The author didn't think of any examples with even a bit more complexity. If you have 2-level object nesting, now what?
If size is really the issue but you still want schema enforcement protobuf is the way to go.
(Incidentally, my day job is building a spreadsheet importer.)
As the post acknowledges right about where you stopped skimming.
And you almost always have/need a nested structure, no?
No.
As I noted in my own comment. Ironic to accuse me of skimming the original post when you couldn't even read my two sentences.
Isn't protobuf basically CSV but with good libraries at the interface point and standards around how to deserialize the streams?
0x1E Record Separator
0x1F Unit SeparatorYou also have 2 more the group separator and the file separator. so you could represent a tree with it.
The size complaint is overblown, as repeated fields are compressed away.
As other folks rightfully commented, csv is a mine field. One should assume every CSV file is broken in some way. They also don't enumerate any of the downsides of CSV.
What people should consider is using formats like Avro or Parquet that carry their schema with them so the data can be loaded and analyzed without have to manually deal with column meaning.
If somebody asked me to support this format after you'd left the company I'd quit on the spot. This frankenformat is 100% premature optimization, non standardised, requires custom parsers (which are potentially inefficient and may negate the network performance from having to parse both json and csv) and is potentially very difficult to maintain and debug (no syntax highlighters or rest like posting tools)
Just either use GRPC or JSON with regular network level gzip encoding.
["hello",5,false,1,2,2.334,null]
["world",12,true,1,2,2.334,null]
Parquet is a columnar format that is compressed, typed, efficient for columnar queries, append-friendly (though not rewritable), and is a natively supported format for Apache Spark. Parquet libraries are now widely available for most languages (didn't used to be the case, but now they are). I query Parquet files with DuckDB in Python and it blazes through GBs of data in seconds.
At work, the canonical format is TSV (tab separated values) which despite being human-readable, is huge, inefficient to query and does not support data types.
When I have to work with large TSV files (10GB or larger), I first convert them into 500MB Parquet files. The latter are faster, smaller and less prone to type errors. Because columnar formats like Parquets are indexed, I can do complex operations like joins, window functions, aggregations on them in a performant way, while any similar operation on TSV files will trigger a table scan each time.
I recently ran a Spark job on a very large TSV file which took over 8 hours and timed out. A Spark job on the same data represented in Parquet completed in 5 minutes.
In my country the decimal separator is comma, instead of punctuation. This causes problems when importing and exporting with this "format".
Just few weeks ago I had fun times working with API returning CSV in unknown encoding. Hopefully they will never make changes (you cannot always trust headers). Ah and i do love when CSV is missing headers and someone adds data into middle.
Of course some of these issues can be avoided by doing the things "right". Sadly you cannot trust this in real life. People write ugly structures in JSON, but at least you can validate results..
Then there's the issue of encoding, as that is also not the same across locales. Then you get a CSV with the BOM characters up front or some French accents represented as ? because of incorrect encoding parsing / saving.
At least JSON doesn't have any of these things. Standardized strings, and standardized number format.
I see a lot of complaining about "no standard" in this thread, but the way I've used them, it's been fine. I just use Haskell's cassava. If human produce them with Excel/LibreOffice, I never have issues on the ingestion end.
csv is just a sequence of 2d byte arrays. probably avoid if dealing with heterogeneous external data. possibly use if dealing with homogeneous internal data.
1. https://github.com/nathants/bsv/tree/55c90797283f5e37f91bbb6...
U+241D, U+241E, U+241F
SOH (U+01), US (U+1F), RS (U+1E), GS (U+1D), FS (U+1C)
[\n
{ ... },\n
{ ... },\n
{ ... },\n
...\n
]\nWith this simple trick you can stream easily..