Consider Using CSV (opens in new tab)

(jfhr.me)

69 pointsjfhr3y ago110 comments

110 comments

CSV looks deceptively simple. It is far too easy to just write(','.join(whatever)), which sort of works, until it doesn’t, and then someone, sometimes I, has to sort out the resulting mess. PLEASE use a proper CSV library (Python comes with a CSV module in the standard library), or at least implement the entire format according to the RFC from the outset, even if you think you won’t need it!

chaps3y ago

Oh yes. CSVs are deceptively challenging especially if your use-case is from excel files to csv. Excel will happily convert a worksheet to csv, but it's a naive conversation. Headers that start on line 3, multi-line headers, inconsistent column counts, etc. It adds up really quickly!

zem3y ago

I've also run into issues where I wrote some code that worked with csv input, and told users they could just export their data from excel. turns out excel doesn't export in utf-8 by default, we had some weird issues until we figured that out.

MrJohz3y ago

Another issue is passing those CSV files across international borders - a CSV file that works in the UK (commas as separators and a decimal point) may not be readable in Germany (semicolons as separators and decimal commas) without some configuration.

kasajian3y ago

This is a matter of developer education. The correct way to create and parse CSV files is to use a third-party library. They can get complicated. A field in a CSV can contain commas and quotes. In some cases, a single field can contain a line-feed, and you'll need to ensure the parser you use supports that. This would allow an entire CSV file to be embedded inside the field of a CSV field. At a minimum, a parser must support Excel's default parser logic.

But, if you pick the right parser and generator, then you're ok with using it.

Macha3y ago

Right, but if you're picking CSV, you likely expect to interoperate with a provider that's not yourself. And then there's no way a parser can handle all CSV formats in the wild.

e.g. the example from my comment on the last CSV discussion (https://news.ycombinator.com/item?id=28223719)

What variant is this:

    1,5,Here is a string "" that does stuff,2021-1-1

What is the value of the third column?

Is this a CSV file without quoting? Then it's

    Here is a string "" that does stuff

Or is it a CSV file with double quote escaping? Then it's

    Here is a string " that does stuff

brundolf3y ago

Yeah, but this is less of a problem if it's an internal API. You can stick to a stricter subset of the standard, and/or only handle the types of column values that you actually need

Still probably worth using a library, but it isn't a source of problems in my experience

ndsipa_pomu3y ago

As much as I like and use CSV for database work, it has a problem with being poorly specified. The most common problems are when processing CSVs produced elsewhere which might not enclose text fields with quotes and thus have issues with data that includes commas and multi-line data.

gregmac3y ago

There is a spec (RFC 4180 [1]) but it's definitely not widely followed. Worse, for a lot of data there's no problems for potentially years, until your numbers get too big or the first time a quote or comma gets in the data.

In my experience one of the biggest barriers I run into -- and the primary reason I hate using CSV -- is Microsoft Excel. It misinterprets numbers as dates, it convers big numeric identifiers to exponents, and more. Even merely opening a RFC4180-compliant file and saving it changes the data, and even Excel itself will often have a different misinterpretation of the de file.

If humans never used Excel for CSV, it would be a viable format. At the same time in most cases where humans aren't in the loop (machine-to-machine communications), there's better formats. You could spec "RFC4180 CSV" and hope no developer just sees the "CSV" and assumes they understand. Or specify something like a JSON streaming format and avoid a whole lot of headache.

[1] https://www.ietf.org/rfc/rfc4180.txt

gregw23y ago

Regarding excel and rfc4180…

I avoided CSV for quite a while because I had excel-vs-CSV compatibility concerns like this.

However, when I tested this for myself a few years back, Excel output to my surprise was rfc4180 or darn near it (it might use CRLF rather than LF?) It emitted commas and quotes the same way as the rfc for all the test cases I checked.

That said, I agree with you Excel is problematic as an input source. Usually the problems are the humans who touch the data in excel, but what I’ve found is the automation problems tend to be with Excel parsing and interpreting incoming data (before it goes to CSV.) Exponents, trimming leading zeros, etc. as you say. But if the data is confirmed good in excel before being emitted, the CSV it emits is decent.

Counterexamples welcome.

CharlesW3y ago

Yes, I feel like this would've been more helpful generalized as "Consider DSV" (delimiter-separated values) than CSV specifically, because of the interop issues that often come up. I'd have also mentioned using Parquet.

mason553y ago

I generally find Avro to be a better replacement for CSV than Parquet. It’s a better drop in for the typical CSV use case of “process this file row by row”.

Parquet is great, don’t get me wrong.

wenc3y ago

Avro is a row-based format that supports schema evolution, but unfortunately it is not a very popular format in general and there's much less support for it than Parquet. My team evaluated several formats (ORC, Parquet, Avro) but landed on Parquet due to columnar performance and wider support, though you're right, it's less performant for row-wise operations (but not by that much). If you really want row-wise performance, better to go with sqlite than Avro.

scrollaway3y ago

Parquet has the opposite problem of CSV though. It's so complex to work with, that unless you're specifically in data science, it's both unheard of and unusable.

To read a parquet file in Python, you need Apache Arrow and Pandas. And literally the second result for "parquet python libraries" is an article titled "How To Read Parquet Files In Python Without a Distributed Cluster".

I remember dealing with Parquet file for a job a while back and this same question came up: Why isn't there a simpler way, for when you're not in the data science stack and you just need to convert a parquet file to csv/json/read rows? Is is a limitation of the format itself?

jointpdf3y ago

We data scientists are well-known for our exclusive mastery data wrangling arcana, like…

  df = pandas.read_parquet(‘foo.parquet’)
  df.to_csv(‘foo.csv’)
  df.to_json(‘foo.json’)

(no sarcasm)—how could it be simpler than that? What problems have you encountered that make it unusable?

1 more reply

CharlesW3y ago

> It's so complex to work with, that unless you're specifically in data science, it's both unheard of and unusable.

FWIW, in my experience at a "data analytics platform" company, it's reasonably popular for data-heavy workflows since Parquet is well-defined, and file sizes (especially as the amount of data grows) are a fraction of their CSV equivalents.

> Is it a limitation of the format itself?

I don't think so. In other languages, you can generally read/write Parquet files without a ton of dependencies (e.g. https://github.com/xitongsys/parquet-go).

wenc3y ago

> It's so complex to work with

This is the opposite of my experience.

> To read a parquet file in Python, you need Apache Arrow and Pandas.

Or DuckDB.

    import duckdb
    df = duckdb.query("select * from 'a.parquet'")

Want to look inside a Parquet file? Use Visidata.

    vd a.parquet

> I remember dealing with Parquet file for a job a while back and this same question came up: Why isn't there a simpler way, for when you're not in the data science stack and you just need to convert a parquet file to csv/json/read rows? Is is a limitation of the format itself?

Do you consider Pandas a "data science" stack? To me, it's just a library like any other that makes it easy to work with tabular data. Even for CSV, there is csvreader (usually not a good idea to deal with CSV by hand). Outputting to CSV is literally a one liner in Pandas or DuckDB.

   import pandas as pd

   # output to CSV
   pd.read_parquet("a.parquet").to_csv("a.csv") 

   # output to JSON (choose from any number of orientations)
   pd.read_parquet("a.parquet").to_json(orient="table")

   # read rows
   for row in pd.read_parquet("a.parquet").itertuples():
       print(row)

fbdab1033y ago

I want to use parquet more frequently, but it creates new problems that do not exist if I dump to CSV. Last I looked, there were not any good GUIs that would let someone quickly browse the data. Now it is just a blob lacking introspection. CSV has issues, but it is universal.

2 more replies

fatneckbeardz3y ago

but which delimiter.

if you choose pipe ok, now you have to make sure nobody typed a pipe into the input field or spreadsheet, and you cannot store unix commands

if you choose tab, ok, now people will get confused when they try to edit the text file to replace tabs with spaces, and now you have trouble putting code snippets into data fields because they have tabs.

this is the problem and it's why xml/json exist.

in my particular domain, tab separated works pretty well but in a general context of the world at large, i feel like JSON has reasons it exists.

elcritch3y ago

Well the obvious solution would be ASCII 0x1D (Group Separator)! Accept, no one actually uses those ASCII characters. Kind of bums me out that UNIX basically skipped out on them.

2 more replies

civopsec3y ago

Both pipe and tab are infinitely better for so-called human-readable data compared to comma. Comma doesn’t even work well for numbers since some locales use comma as the decimal separator. And a data format can’t be “human-readable” if you’re not allowed to write numbers in the way that you’re used to write them.

ndsipa_pomu3y ago

Pipes are quite common, but for tricky data, I'd recommend ¬. It's on most keyboards and I can't think of any other use of it.

1 more reply

toast03y ago

> but which delimiter

Control characters. Like ctrl-A and stuff. Almost nobody has them in their data.

fbdab1033y ago

In the context of an API so long as using a "real" CSV library a lot of those inconsistencies do not appear. Problems happen when you have to interface with humans and desktop software (ie Excel) which has its own rules.

hermitcrab3y ago

Yes, CSV is superior to JSON for tabular data, but has it's own issues. One issue is that the standard is not consistently applied. Another is the approach to escaping means that it is hard to parse a CSV file with multiple threads. You have to parse the entire file before you can be sure which " characters escape other characters.

I wrote an article about tabular formats and their strengths and weaknesses here: https://successfulsoftware.net/2022/04/30/why-isnt-there-a-d...

The resulting HN discussion is here: https://news.ycombinator.com/item?id=31220841

_dain_3y ago

>CSV is superior to JSON for tabular data

Can't you just do this?

  {
    "columns": ["col1", "col2", "col3"],
    "data": [
               [1,      2,      3],
               [4,      5,      6],
               [7,      8,      9]
    ]
  }

That's valid JSON but it's human-readable and human-editable rows of comma-separated data, just like CSV.

hermitcrab3y ago

You can. But I don't see how that is superior to the equivalent CSV.

1 more reply

II2II3y ago

> The most common problems are when processing CSVs produced elsewhere [...]

The limitations of CSV are certainly worth considering and, in the instances you mentioned, it may be not be worth using CSV. (If you are going to be using a more complex parser anyway, you may as well using a format that is better defined and where you are less likely to encounter edge cases.) That being said, there remain many cases where CSV is far more efficient and far less error prone.

ndsipa_pomu3y ago

It tends to be a lowest common denominator or a choice between CSV and Excel documents which are trickier to automate.

Kamq3y ago

Which works until you have a quote in your data. It also prevents chunking out the file to process in parallel since you could have new lines between the quotes.

What I've found to work well is to just % encode your delimiter, the new line character, and the '%' character. Basically every language has utilities for this.

Doesn't solve the issue with accepting outside files though. You have to be pessimistic with those regardless.

majkinetor3y ago

With gzip on web server the difference is not important at all.

CSV in general is problematic as there is no standard (RFC 4180 is not). In certain contexts this surely can be good solution but definitelly not good in general scenario.

CharlesW3y ago

As Wikipedia puts it, "CSV is widely used to refer to a large family of formats that differ in many ways". If there's a canonical standard, it appears to be RFC4180: https://www.rfc-editor.org/rfc/rfc4180

majkinetor3y ago

It appears, but its not. I have not found single program so far that conforms only to this RFC and nothing else.

From the RFC itself:

   Status of This Memo

   This memo provides information for the Internet community.  It does
   not specify an Internet standard of any kind.  Distribution of this
   memo is unlimited.

CharlesW3y ago

> I have not found single program so far that conforms only to this RFC and nothing else.

Wouldn't that be impossible, given that parsers have to accept all kind of bizarro CSV flavors? Maybe more importantly, do you know of a single program or single CSV library that doesn't support reading or writing CSV as defined by the RFC?

1 more reply

Karellen3y ago

An "Internet Standard" is just a designation that has been given to an RFC that has been blessed in a certain way. See https://www.rfc-editor.org/ for more details, but the set of designations is:

    * Uncategorised
    * Historic
    * Experimental
    * Informational
    * Best Current Practice
    * Proposed Standard
    * Draft Standard
    * Internet Standard

Once an RFC reaches "Internet Standard" it is given a special designation, e.g. STD-63 is the standards designation for RFC-3629: UTF-8 < https://www.rfc-editor.org/info/std63 >. See https://www.rfc-editor.org/standards

Being an "Internet Standard" is kinda special, but not especially so. For example, IMAP4, originally specified in RFC-3501 in March 2003, updated many times since, and revised in RFC-9051 in August 2021, is still a "Proposed Standard" without an STD designation, nearly 20 years and dozens of interoperable implementations later.

"Rough consensus and running code" is how things get done.

RFC-4180 is plenty good enough a "standard" for people to decide to interoperate over. They just have to decide to do so.

(Note also that HTML5 is not an "Internet Standard" according to the IETF et al. The last version to get an RFC was HTML 2 in RFC-1866, designated "Historic". And interoperability was an issue for a while with later versions of HTML during the "Best viewed in Internet Explorer/Netscape Navigator" wars. To get interoperability like we eventually did, you don't need an "Internet Standard"; you just need implementers who want to interoperate, and are willing to favour it over lock-in, and even over strict backwards-compatibility.)

(Also, the "and nothing else" clause in your comment confuses me. Why not support other formats/variants also? "Be liberal in what you accept" is certainly something that you probably want to avoid if you're designing a new format/protocol that no-one else is using yet, but if you're working with a decades-old format that was traditionally poorly-specified, with millions of documents out in the wild, it's probably the best way to allow existing users to move forward.)

briHass3y ago

That was my first thought: JSON is highly amenable to compression; due to the repetition this blog is complaining about. It's a good lesson for junior devs: if you find yourself thinking about saving bits and bytes with custom protocols, you need to pull out of the rabbit hole and find the existing solution to your problem.

Sure, for a local data file or something where it's nice to be human-readable-ish, CSV can be a better choice than JSON (assuming you use a library for all the edge cases and string escapes.) If you really want a super-small and fast serialization, that's what protobuf is for.

thangalin3y ago

CSV is also great for importing external data into documents. My text editor, KeenWrite[0], includes an R engine and a CSV-to-Markdown function[1]. This means you can write the following in a plain text R Markdown document:

    `r#csv2md('filanme.csv')`

The editor will convert Markdown to XHTML in the preview panel (in real time), then ConTeXt can typeset the XHTML into a PDF file in various styles.[2][3] This avoids spending time fighting with table formatting/consistency in certain word processors while storing the data in a machine-friendly format. (Thereby upholding the DRY principle because the data can have a single source of truth, as opposed to copying data into documents, which could go stale/diverge.)

Using JSON would be possible, but it's not as easy to convert into a Markdown table.

[0]: https://github.com/DaveJarvis/keenwrite

[1]: https://github.com/DaveJarvis/keenwrite/blob/main/R/csv.R#L3...

[2]: https://i.ibb.co/6FLXKsD/keenwrite-csv.png

[3]: https://i.ibb.co/47h6zNx/keenwrite-table.png

gugagore3y ago

The only reason, in my eyes, to use CSV is to have easy interoperability with spreadsheet software.

If you want streaming: https://jsonlines.org/

Karellen3y ago

JSON lines looks kinda interesting, but the newline-delimited thing seems weird.

It seems to me that you could write a JSON streaming parser that, if the outer element is an Array, reads and outputs/processes one Array element at a time as its JSON value. Yeah, you can't get the array length, and if there's a parse error somewhere down the line then you have to figure out how to deal with that (a non-streaming parser would have rejected the whole input), but that's kind of inherent in using a streaming parser. The upside is that you can work with any valid JSON.

Sure, if you're interoperating with shell tools, and don't have `jq` available, newline-delimited JSON might be helpful. But on the other hand, just install `jq`, dummy!

majkinetor3y ago

Meh.

Excel compatibility really sux. And Excel is most used one by large. You really can't double click it effectivelly, as everything will be shown as generic type, so you have to mess up with wizard which is also half baked. I have to create tutorials for that for each service using it.

elcritch3y ago

Sometimes CSV is nicer. Still you can cut down on your JSON by formatting it as a similar header style:

    [
      ["productId", "quantity", "customerId"],
      ["5710031efdfe", 1, "8fe96b88"],
      ["479cd9744e5c", 2, "526ba6f5"]
    ]

This style also works well with jsonlines a sibling comment mentioned. Of course my favorite is MessagePack (or CBOR) using similar styles. MsgPack can be as small as gzipped JSON. :)

account-53y ago

I think one of the issues it data types. JSON has them CSV doesn't, so this means your program needs to be aware of which columns are which data type and do the conversion where needed.

It's similar to JSON Vs INI files for config files.

On a different note I wouldn't nest JSON in a CSV column. I'd delimit with a pipe or something the split string on that. Much simpler if you're in control of the data.

ajanuary3y ago

How often is this a concern in practice? It’s a question I’ve been thinking about a bunch and the answer I keep coming back to is that most of the time, encoding the type in the data exchange format isn’t actually very useful. So I would be interested in use cases where it is.

Are you just trusting that the types coming in are going to be the correct ones? What happens if someone sends you `{“foo”: “10”}` instead of `{“foo”: 10}`? Do you validate with a schema up front? In which case your code already needs to know what types it expects and can convert them. Or are you letter the incorrect types run through your system until it hits a type error at runtime somewhere?

account-53y ago

> How often is this a concern in practice?

No idea really, but if you're using a JSON parsing library then that is going to automatically convert for the data types. Which, provided you trust the data, saves you a job.

majkinetor3y ago

JSON also has schema that can be used to verify it.

bufferoverflow3y ago

The author didn't compare gzipped/brottlied sizes.

The author didn't think of any examples with even a bit more complexity. If you have 2-level object nesting, now what?

ARandomerDude3y ago

> It's only 77 bytes, with 29 for the header and 24 for each line. At 100,000 entries, this list would be 2.4 MB (that's ~63% less than the JSON).

If size is really the issue but you still want schema enforcement protobuf is the way to go.

saulpw3y ago

protobuf is terrible! Now you have to rely on Google-scale tools to generate code for whatever language(s) you want to read or write the data in, and this becomes quite the encumbrance.

out_of_protocol3y ago

I'd go with sqlite instead. Also, there are specialized formats like Parquet

xwowsersx3y ago

I mean point well taken, but, as they acknowledged in the post themselves, CSV isn't suitable when you have a nested structure. And you almost always have/need a nested structure, no?

tremon3y ago

Relational databases have worked fine for decades without nested structures. The simple trick is to take the nested structure out of the entity and into its own table.

xwowsersx3y ago

Unless I misunderstood something, I'm not sure I understand the relevance here. I assumed we were talking about sending data to clients. In such cases, you do not send database tables. Instead, you send rich, fully hydrated objects which are the result of joining those tables. The serialized representation can be backed by the relational model, but at some point you have to put those together to send something useful to the client. My only point is that CSV is unsuitable for this task in many/most cases.

majkinetor3y ago

That may be simple trick for the db, but not when your paradigm involves importing files - imaging telling that to users, instead of giving json, please give 75 csv files.

anamexis3y ago

This scenario might be more common than you think -- spreadsheets still reign supreme, and often 75 csv files is how the users have the data to begin with.

(Incidentally, my day job is building a spreadsheet importer.)

sitkack3y ago

Did they? And all the databases I use regularly support nested structures, they are extremely expressive.

akhmatova3y ago

CSV isn't suitable when you have a nested structure.

As the post acknowledges right about where you stopped skimming.

And you almost always have/need a nested structure, no?

No.

xwowsersx3y ago

> as they acknowledged in the post themselves

As I noted in my own comment. Ironic to accuse me of skimming the original post when you couldn't even read my two sentences.

akhmatova3y ago

My very bad - please have my sincerest apology.

1 more reply

YmiYugy3y ago

I always thought CSV was just fine, until I had to ingest and export a bunch of CSV in my last project. The big problem is that CSV is not well defined and it's so deceptively simple that many don't bother to adhere to the spec that does exist. Just a few idiosyncrasies I found: Inconsistent character encoding. If you open or save a csv with Excel it will assume a Windows-1252 encoding. Since browsers deal exclusively with UTF-8, this get's really messy. The CSV I got didn't actually use a comma as a delimiter but a semicolon. Everyone seems to have conflicting options about whether strings should have quotes and if so, which ones. The CSV I had to deal with also came with a decimal comma, which screwed up even more stuff. My advice stay away from CSV as an exchange format. Use something that is well defined.

pcthrowaway3y ago

I'm definitely in the "Just use JSON for most things" camp, but I'm wondering, why would you ever choose CSV for interfacing microservices over protobuf?

Isn't protobuf basically CSV but with good libraries at the interface point and standards around how to deserialize the streams?

panzerboiler3y ago

I usually prefer a binary encoding. More efficient on the wire, easier to parse and generate, and with no ambiguity. We have 2 control codes given to us by the teletype era that have the perfect meaning for this kind of data:

    0x1E Record Separator
    0x1F Unit Separator

nmz3y ago

and because its a single byte, its fast, no need to tokenize.

You also have 2 more the group separator and the file separator. so you could represent a tree with it.

sitkack3y ago

No one uses that format for streamed json, see ndson and jsonl

http://ndjson.org/

The size complaint is overblown, as repeated fields are compressed away.

As other folks rightfully commented, csv is a mine field. One should assume every CSV file is broken in some way. They also don't enumerate any of the downsides of CSV.

What people should consider is using formats like Avro or Parquet that carry their schema with them so the data can be loaded and analyzed without have to manually deal with column meaning.

majkinetor3y ago

Since this is about CSV, this is obligatory tool for larger ones:

* https://github.com/antonycourtney/tad

sitkack3y ago

For manipulating CSV from the terminal, check out https://github.com/BurntSushi/xsv

speq3y ago

There's a fork with new features: https://github.com/jqnatividad/qsv

sitkack3y ago

This is amazing work. Thanks for bringing it to my attention. Hopefully it and xsv can be merged at some point in the future.

majkinetor3y ago

Unless you really need ultra performance, PowerShell is certainly much better option.

SillyUsername3y ago

Holy cow.

If somebody asked me to support this format after you'd left the company I'd quit on the spot. This frankenformat is 100% premature optimization, non standardised, requires custom parsers (which are potentially inefficient and may negate the network performance from having to parse both json and csv) and is potentially very difficult to maintain and debug (no syntax highlighters or rest like posting tools)

Just either use GRPC or JSON with regular network level gzip encoding.

beached_whale3y ago

A constrained format based on JSONL with each record being a tuple of number/string/bool/null could better defined than CSV and looks almost like it. The benefit being, almost any json library could work with it, or could be made to one line at a time and it can be parallelized as newlines only exist as the delimiter.

["hello",5,false,1,2,2.334,null]

["world",12,true,1,2,2.334,null]

albertopv3y ago

What else do you use if you have to import millions of rows from a client or supplier without direct integration but sftp?

CharlesW3y ago

Often, Parquet. https://parquet.apache.org/

wenc3y ago

It often surprises me that Parquet is not widely known outside of data engineering circles. Most software developers are still mucking around with CSV for large tabular data, which is absolutely the wrong format. Better developers use sqlite, which is less wrong but still wrong. Postgres is closer to the right answer, but for very large, typed tabular data, Parquet is the way to go.

Parquet is a columnar format that is compressed, typed, efficient for columnar queries, append-friendly (though not rewritable), and is a natively supported format for Apache Spark. Parquet libraries are now widely available for most languages (didn't used to be the case, but now they are). I query Parquet files with DuckDB in Python and it blazes through GBs of data in seconds.

At work, the canonical format is TSV (tab separated values) which despite being human-readable, is huge, inefficient to query and does not support data types.

When I have to work with large TSV files (10GB or larger), I first convert them into 500MB Parquet files. The latter are faster, smaller and less prone to type errors. Because columnar formats like Parquets are indexed, I can do complex operations like joins, window functions, aggregations on them in a performant way, while any similar operation on TSV files will trigger a table scan each time.

I recently ran a Spark job on a very large TSV file which took over 8 hours and timed out. A Spark job on the same data represented in Parquet completed in 5 minutes.

spentu3y ago

I cannot count how many times CSV "format" has caused problems for me..

In my country the decimal separator is comma, instead of punctuation. This causes problems when importing and exporting with this "format".

Just few weeks ago I had fun times working with API returning CSV in unknown encoding. Hopefully they will never make changes (you cannot always trust headers). Ah and i do love when CSV is missing headers and someone adds data into middle.

Of course some of these issues can be avoided by doing the things "right". Sadly you cannot trust this in real life. People write ugly structures in JSON, but at least you can validate results..

WirelessGigabit3y ago

No. Just no. The amount of times I've had issues with CSVs exported from a non-US locale is insane. They use semi-colon as separator, as for some weird reason they use the comma as the decimal point.

Then there's the issue of encoding, as that is also not the same across locales. Then you get a CSV with the BOM characters up front or some French accents represented as ? because of incorrect encoding parsing / saving.

At least JSON doesn't have any of these things. Standardized strings, and standardized number format.

sheeeep863y ago

You could have the advantages of both worlds by having one json object per line. You could stream process, and you could structure more complex objects and have consistent escaping.

gugagore3y ago

https://jsonlines.org/

nmz3y ago

You could do that, you could also have a json that is not streamable. You can't guarantee how large a json object will be but you can guess that the csv will probable be.

ledauphin3y ago

and the "redundant" headers will get compressed away to nearly nothing over most transports.

whateveracct3y ago

I quite like CSVs. I've used them to great effect at maybe every job I've ever had. xsv, sqlite, and Excel/LibreOffice provide useful tooling on top of them.

I see a lot of complaining about "no standard" in this thread, but the way I've used them, it's been fine. I just use Haskell's cassava. If human produce them with Excel/LibreOffice, I never have issues on the ingestion end.

slotrans3y ago

Please don't. CSV is one of the worst file formats ever conceived. Use (compressed) line-delimited JSON if you need a file of records.

cpeterso3y ago

Another alternative is a streaming JSON format like JSONL (newline-delimited JSON). You can parse one record/line at a time, but still have the structure and named fields of JSON.

https://en.m.wikipedia.org/wiki/JSON_streaming

brundolf3y ago

I worked at a company where we did this for some endpoints and it worked great. Our client app had to request enormous time-series datasets and using CSV cut a significant percentage off of the payload size. I recommend it if you have similar constraints

nathants3y ago

i had a lot of fun exploring the performance ceiling of csv and csv like formats. turns out binary encoding of size prefixed byte arrays is fast[1].

csv is just a sequence of 2d byte arrays. probably avoid if dealing with heterogeneous external data. possibly use if dealing with homogeneous internal data.

1. https://github.com/nathants/bsv/tree/55c90797283f5e37f91bbb6...

fellowniusmonk3y ago

Delimited formats performance can be exceptional, they can also be phenomenally terse and avoid the string tarpits of CSV and TSV if you just use these unicode characters.

U+241D, U+241E, U+241F

teddyh3y ago

Those are not the unit/record/group separator characters! Those are the graphical symbols for the unit/record/group separator codes. The actual unit/record/group separator codes are in ASCII, as 'tremon' writes in a sibling comment.

fellowniusmonk3y ago

I guess I shouldn't post comments when I'm terribly hungover. The fact remains, use those seperators, most developers don't even seem aware of them to any degree.

tremon3y ago

Or these characters, from the ASCII era:

SOH (U+01), US (U+1F), RS (U+1E), GS (U+1D), FS (U+1C)

pkstn3y ago

Use gzip for compressing. If you want to stream, use following syntax:

[\n

  { ... },\n

  { ... },\n

  { ... },\n

  ...\n

]\n

With this simple trick you can stream easily..

pkstn3y ago

in client-side you can parse bunch of items by just wrapping with [ and ]

pkstn3y ago

and well, there's also BSON: https://github.com/mongodb/js-bson

revskill3y ago

Sure! For example, for batch processing, CSV is always the default for me and the teams.

margarina723y ago

you may also simply add a format specification and return either csv or json depending on the need or the context. Most language would have what it needs to return either without much trouble.

dsmmcken3y ago

You could also consider Kafka for streaming, and Parquet for batch.

j / k navigate · click thread line to collapse

110 comments

Pinus3y ago

chaps3y ago

zem3y ago

MrJohz3y ago

kasajian3y ago

But, if you pick the right parser and generator, then you're ok with using it.

Macha3y ago

Right, but if you're picking CSV, you likely expect to interoperate with a provider that's not yourself. And then there's no way a parser can handle all CSV formats in the wild.

e.g. the example from my comment on the last CSV discussion (https://news.ycombinator.com/item?id=28223719)

What variant is this:

    1,5,Here is a string "" that does stuff,2021-1-1

What is the value of the third column?

Is this a CSV file without quoting? Then it's

    Here is a string "" that does stuff

Or is it a CSV file with double quote escaping? Then it's

    Here is a string " that does stuff

brundolf3y ago

Yeah, but this is less of a problem if it's an internal API. You can stick to a stricter subset of the standard, and/or only handle the types of column values that you actually need

Still probably worth using a library, but it isn't a source of problems in my experience

ndsipa_pomu3y ago

gregmac3y ago

[1] https://www.ietf.org/rfc/rfc4180.txt

gregw23y ago

Regarding excel and rfc4180…

I avoided CSV for quite a while because I had excel-vs-CSV compatibility concerns like this.

Counterexamples welcome.

CharlesW3y ago

mason553y ago

I generally find Avro to be a better replacement for CSV than Parquet. It’s a better drop in for the typical CSV use case of “process this file row by row”.

Parquet is great, don’t get me wrong.

wenc3y ago

scrollaway3y ago

Parquet has the opposite problem of CSV though. It's so complex to work with, that unless you're specifically in data science, it's both unheard of and unusable.

jointpdf3y ago

We data scientists are well-known for our exclusive mastery data wrangling arcana, like…

  df = pandas.read_parquet(‘foo.parquet’)
  df.to_csv(‘foo.csv’)
  df.to_json(‘foo.json’)

(no sarcasm)—how could it be simpler than that? What problems have you encountered that make it unusable?

1 more reply

CharlesW3y ago

> It's so complex to work with, that unless you're specifically in data science, it's both unheard of and unusable.

> Is it a limitation of the format itself?

I don't think so. In other languages, you can generally read/write Parquet files without a ton of dependencies (e.g. https://github.com/xitongsys/parquet-go).

wenc3y ago

> It's so complex to work with

This is the opposite of my experience.

> To read a parquet file in Python, you need Apache Arrow and Pandas.

Or DuckDB.

    import duckdb
    df = duckdb.query("select * from 'a.parquet'")

Want to look inside a Parquet file? Use Visidata.

    vd a.parquet

   import pandas as pd

   # output to CSV
   pd.read_parquet("a.parquet").to_csv("a.csv") 

   # output to JSON (choose from any number of orientations)
   pd.read_parquet("a.parquet").to_json(orient="table")

   # read rows
   for row in pd.read_parquet("a.parquet").itertuples():
       print(row)

fbdab1033y ago

2 more replies

fatneckbeardz3y ago

but which delimiter.

if you choose pipe ok, now you have to make sure nobody typed a pipe into the input field or spreadsheet, and you cannot store unix commands

this is the problem and it's why xml/json exist.

in my particular domain, tab separated works pretty well but in a general context of the world at large, i feel like JSON has reasons it exists.

elcritch3y ago

Well the obvious solution would be ASCII 0x1D (Group Separator)! Accept, no one actually uses those ASCII characters. Kind of bums me out that UNIX basically skipped out on them.

2 more replies

civopsec3y ago

ndsipa_pomu3y ago

Pipes are quite common, but for tricky data, I'd recommend ¬. It's on most keyboards and I can't think of any other use of it.

1 more reply

toast03y ago

> but which delimiter

Control characters. Like ctrl-A and stuff. Almost nobody has them in their data.

fbdab1033y ago

hermitcrab3y ago

I wrote an article about tabular formats and their strengths and weaknesses here: https://successfulsoftware.net/2022/04/30/why-isnt-there-a-d...

The resulting HN discussion is here: https://news.ycombinator.com/item?id=31220841

_dain_3y ago

>CSV is superior to JSON for tabular data

Can't you just do this?

  {
    "columns": ["col1", "col2", "col3"],
    "data": [
               [1,      2,      3],
               [4,      5,      6],
               [7,      8,      9]
    ]
  }

That's valid JSON but it's human-readable and human-editable rows of comma-separated data, just like CSV.

hermitcrab3y ago

You can. But I don't see how that is superior to the equivalent CSV.

1 more reply

II2II3y ago

> The most common problems are when processing CSVs produced elsewhere [...]

ndsipa_pomu3y ago

It tends to be a lowest common denominator or a choice between CSV and Excel documents which are trickier to automate.

Kamq3y ago

Which works until you have a quote in your data. It also prevents chunking out the file to process in parallel since you could have new lines between the quotes.

What I've found to work well is to just % encode your delimiter, the new line character, and the '%' character. Basically every language has utilities for this.

Doesn't solve the issue with accepting outside files though. You have to be pessimistic with those regardless.

majkinetor3y ago

With gzip on web server the difference is not important at all.

CSV in general is problematic as there is no standard (RFC 4180 is not). In certain contexts this surely can be good solution but definitelly not good in general scenario.

CharlesW3y ago

majkinetor3y ago

It appears, but its not. I have not found single program so far that conforms only to this RFC and nothing else.

From the RFC itself:

   Status of This Memo

   This memo provides information for the Internet community.  It does
   not specify an Internet standard of any kind.  Distribution of this
   memo is unlimited.

CharlesW3y ago

> I have not found single program so far that conforms only to this RFC and nothing else.

1 more reply

Karellen3y ago

An "Internet Standard" is just a designation that has been given to an RFC that has been blessed in a certain way. See https://www.rfc-editor.org/ for more details, but the set of designations is:

    * Uncategorised
    * Historic
    * Experimental
    * Informational
    * Best Current Practice
    * Proposed Standard
    * Draft Standard
    * Internet Standard

"Rough consensus and running code" is how things get done.

RFC-4180 is plenty good enough a "standard" for people to decide to interoperate over. They just have to decide to do so.

briHass3y ago

thangalin3y ago

    `r#csv2md('filanme.csv')`

Using JSON would be possible, but it's not as easy to convert into a Markdown table.

[0]: https://github.com/DaveJarvis/keenwrite

[1]: https://github.com/DaveJarvis/keenwrite/blob/main/R/csv.R#L3...

[2]: https://i.ibb.co/6FLXKsD/keenwrite-csv.png

[3]: https://i.ibb.co/47h6zNx/keenwrite-table.png

gugagore3y ago

The only reason, in my eyes, to use CSV is to have easy interoperability with spreadsheet software.

If you want streaming: https://jsonlines.org/

Karellen3y ago

JSON lines looks kinda interesting, but the newline-delimited thing seems weird.

Sure, if you're interoperating with shell tools, and don't have `jq` available, newline-delimited JSON might be helpful. But on the other hand, just install `jq`, dummy!

majkinetor3y ago

Meh.

elcritch3y ago

Sometimes CSV is nicer. Still you can cut down on your JSON by formatting it as a similar header style:

    [
      ["productId", "quantity", "customerId"],
      ["5710031efdfe", 1, "8fe96b88"],
      ["479cd9744e5c", 2, "526ba6f5"]
    ]

This style also works well with jsonlines a sibling comment mentioned. Of course my favorite is MessagePack (or CBOR) using similar styles. MsgPack can be as small as gzipped JSON. :)

account-53y ago

I think one of the issues it data types. JSON has them CSV doesn't, so this means your program needs to be aware of which columns are which data type and do the conversion where needed.

It's similar to JSON Vs INI files for config files.

On a different note I wouldn't nest JSON in a CSV column. I'd delimit with a pipe or something the split string on that. Much simpler if you're in control of the data.

ajanuary3y ago

account-53y ago

> How often is this a concern in practice?

No idea really, but if you're using a JSON parsing library then that is going to automatically convert for the data types. Which, provided you trust the data, saves you a job.

majkinetor3y ago

JSON also has schema that can be used to verify it.

bufferoverflow3y ago

The author didn't compare gzipped/brottlied sizes.

The author didn't think of any examples with even a bit more complexity. If you have 2-level object nesting, now what?

ARandomerDude3y ago

> It's only 77 bytes, with 29 for the header and 24 for each line. At 100,000 entries, this list would be 2.4 MB (that's ~63% less than the JSON).

If size is really the issue but you still want schema enforcement protobuf is the way to go.

saulpw3y ago

protobuf is terrible! Now you have to rely on Google-scale tools to generate code for whatever language(s) you want to read or write the data in, and this becomes quite the encumbrance.

out_of_protocol3y ago

I'd go with sqlite instead. Also, there are specialized formats like Parquet

xwowsersx3y ago

I mean point well taken, but, as they acknowledged in the post themselves, CSV isn't suitable when you have a nested structure. And you almost always have/need a nested structure, no?

tremon3y ago

Relational databases have worked fine for decades without nested structures. The simple trick is to take the nested structure out of the entity and into its own table.

xwowsersx3y ago

majkinetor3y ago

That may be simple trick for the db, but not when your paradigm involves importing files - imaging telling that to users, instead of giving json, please give 75 csv files.

anamexis3y ago

This scenario might be more common than you think -- spreadsheets still reign supreme, and often 75 csv files is how the users have the data to begin with.

(Incidentally, my day job is building a spreadsheet importer.)

sitkack3y ago

Did they? And all the databases I use regularly support nested structures, they are extremely expressive.

akhmatova3y ago

CSV isn't suitable when you have a nested structure.

As the post acknowledges right about where you stopped skimming.

And you almost always have/need a nested structure, no?

No.

xwowsersx3y ago

> as they acknowledged in the post themselves

As I noted in my own comment. Ironic to accuse me of skimming the original post when you couldn't even read my two sentences.

akhmatova3y ago

My very bad - please have my sincerest apology.

1 more reply

YmiYugy3y ago

pcthrowaway3y ago

I'm definitely in the "Just use JSON for most things" camp, but I'm wondering, why would you ever choose CSV for interfacing microservices over protobuf?

Isn't protobuf basically CSV but with good libraries at the interface point and standards around how to deserialize the streams?

panzerboiler3y ago

    0x1E Record Separator
    0x1F Unit Separator

nmz3y ago

and because its a single byte, its fast, no need to tokenize.

You also have 2 more the group separator and the file separator. so you could represent a tree with it.

sitkack3y ago

No one uses that format for streamed json, see ndson and jsonl

http://ndjson.org/

The size complaint is overblown, as repeated fields are compressed away.

As other folks rightfully commented, csv is a mine field. One should assume every CSV file is broken in some way. They also don't enumerate any of the downsides of CSV.

What people should consider is using formats like Avro or Parquet that carry their schema with them so the data can be loaded and analyzed without have to manually deal with column meaning.

majkinetor3y ago

Since this is about CSV, this is obligatory tool for larger ones:

* https://github.com/antonycourtney/tad

sitkack3y ago

For manipulating CSV from the terminal, check out https://github.com/BurntSushi/xsv

speq3y ago

There's a fork with new features: https://github.com/jqnatividad/qsv

sitkack3y ago

This is amazing work. Thanks for bringing it to my attention. Hopefully it and xsv can be merged at some point in the future.

majkinetor3y ago

Unless you really need ultra performance, PowerShell is certainly much better option.

SillyUsername3y ago

Holy cow.

Just either use GRPC or JSON with regular network level gzip encoding.

beached_whale3y ago

["hello",5,false,1,2,2.334,null]

["world",12,true,1,2,2.334,null]

albertopv3y ago

What else do you use if you have to import millions of rows from a client or supplier without direct integration but sftp?

CharlesW3y ago

Often, Parquet. https://parquet.apache.org/

wenc3y ago

At work, the canonical format is TSV (tab separated values) which despite being human-readable, is huge, inefficient to query and does not support data types.

I recently ran a Spark job on a very large TSV file which took over 8 hours and timed out. A Spark job on the same data represented in Parquet completed in 5 minutes.

spentu3y ago

I cannot count how many times CSV "format" has caused problems for me..

In my country the decimal separator is comma, instead of punctuation. This causes problems when importing and exporting with this "format".

Of course some of these issues can be avoided by doing the things "right". Sadly you cannot trust this in real life. People write ugly structures in JSON, but at least you can validate results..

WirelessGigabit3y ago

No. Just no. The amount of times I've had issues with CSVs exported from a non-US locale is insane. They use semi-colon as separator, as for some weird reason they use the comma as the decimal point.

At least JSON doesn't have any of these things. Standardized strings, and standardized number format.

sheeeep863y ago

You could have the advantages of both worlds by having one json object per line. You could stream process, and you could structure more complex objects and have consistent escaping.

gugagore3y ago

https://jsonlines.org/

nmz3y ago

You could do that, you could also have a json that is not streamable. You can't guarantee how large a json object will be but you can guess that the csv will probable be.

ledauphin3y ago

and the "redundant" headers will get compressed away to nearly nothing over most transports.

whateveracct3y ago

I quite like CSVs. I've used them to great effect at maybe every job I've ever had. xsv, sqlite, and Excel/LibreOffice provide useful tooling on top of them.

slotrans3y ago

Please don't. CSV is one of the worst file formats ever conceived. Use (compressed) line-delimited JSON if you need a file of records.

cpeterso3y ago

Another alternative is a streaming JSON format like JSONL (newline-delimited JSON). You can parse one record/line at a time, but still have the structure and named fields of JSON.

https://en.m.wikipedia.org/wiki/JSON_streaming

brundolf3y ago

nathants3y ago

i had a lot of fun exploring the performance ceiling of csv and csv like formats. turns out binary encoding of size prefixed byte arrays is fast[1].

csv is just a sequence of 2d byte arrays. probably avoid if dealing with heterogeneous external data. possibly use if dealing with homogeneous internal data.

1. https://github.com/nathants/bsv/tree/55c90797283f5e37f91bbb6...

fellowniusmonk3y ago

Delimited formats performance can be exceptional, they can also be phenomenally terse and avoid the string tarpits of CSV and TSV if you just use these unicode characters.

U+241D, U+241E, U+241F

teddyh3y ago

fellowniusmonk3y ago

I guess I shouldn't post comments when I'm terribly hungover. The fact remains, use those seperators, most developers don't even seem aware of them to any degree.

tremon3y ago

Or these characters, from the ASCII era:

SOH (U+01), US (U+1F), RS (U+1E), GS (U+1D), FS (U+1C)

pkstn3y ago

Use gzip for compressing. If you want to stream, use following syntax:

[\n

  { ... },\n

  { ... },\n

  { ... },\n

  ...\n

]\n

With this simple trick you can stream easily..

pkstn3y ago

in client-side you can parse bunch of items by just wrapping with [ and ]

pkstn3y ago

and well, there's also BSON: https://github.com/mongodb/js-bson

revskill3y ago

Sure! For example, for batch processing, CSV is always the default for me and the teams.

margarina723y ago

you may also simply add a format specification and return either csv or json depending on the need or the context. Most language would have what it needs to return either without much trouble.

dsmmcken3y ago

You could also consider Kafka for streaming, and Parquet for batch.

j / k navigate · click thread line to collapse