I work with large volumes of data from many different sources. I’m lucky to get them to send csv. Of course there are better formats, but all these sources aren’t able to agree on some successful format.
Csv that’s zipped is producible and readable by everyone. And that makes is more efficient.
I’ve been reading these “everyone is stupid, why don’t they just do the simple, right thing and I don’t understand the real reason for success” articles for so long it just makes me think the author doesn’t have a mentor or an editor with deep experience.
It’s like arguing how much mp3 sucks and how we should all just use flac.
The author means well, I’m sure. Maybe his next article will be about how airlines should speak Esperanto because English is such a flawed language. That’s a clever and unique observation.
All that said (and perhaps counter to what I said), I do hope "Unicode Separated Values" takes off. It's essentially just a slight tweak to CSV where the delimiters are special unicode characters, so you don't have to have complicated quoting/escaping logic, and it also supports multiple sheets (i.e. a workbook) in a single file.
And the answer to that is always no. You will it think it's yes because it works for you, but when you send it to someone who has a different Excel version or simply different regional settings, it won't work. The recipient will first have to figure out what dialect you used to export.
If Excel had a standardised "Save as USV" option it would solve so many issues for me.
I get so many broken CSVs from third-parties
Commas can be typed by anyone on any keyboard and readable by anyone.
Special Unicode Characters(tm) can't be typed by anyone on any keyboard and readable by noone.
Convenience is a virtue.
https://www.iana.org/assignments/media-types/text/tab-separa...
Besides all the downsides CSV has, as soon as it's not only machine-machine communication and a human is involved, CSV is just simole enough.
And really is in search of a problem to solve.
If you have important data being shuffled around systems, pick something with a specification instead.
I also think we shouldn't blindly dismiss the performance drawbacks of CSV when working with data pipelines. At even modest scales it becomes hard to work with very large CSVs because the data often doesn't fit into memory, a problem easily solved by Parquet and other formats assuming you only need a subset.
Csv is not very performant, but it doesn’t matter for these use cases.
I’ll also add that I’m not working with the csvs, they are just I/o. So any memory issues are handled by the load process. I certainly don’t use csvs for my internal processes. Just for when someone sends me data or I have to send it back to them.
That being said my workstation is pretty big and can handle 10s of gigs of csv before I care. But that’s usually just for dev or debugging and anything that sticks around will be working with data in some proper store (usually parquet distributed across nodes).
Parquet and Avro are widely supported in backend languages and also in data analysis. I don't think the article is talking about exported-like-a-jpeg, but instead exported-like-a-daily-report-run: the data scientist doing the exporting is probably using R or Pandas instead of Excel, and can reasonably be expected to read https://arrow.apache.org/docs/r/reference/read_parquet.html.
"Maybe his next article will be about how airlines should speak Esperanto because English is such a flawed language. That’s a clever and unique observation."
Hm.Not saying csv doesn’t have its issues, but I don’t think the author made a convincing argument.
A lot of the issues the author brought up didn’t sound that bad and/or it sounds like he never looked at the source data first.
If you’re doing work with large datasets, I think it’s a good practice to at least go and look at the source data briefly to see what to expect.
This will give you a good idea of the format it outputs, data types, some domain context, etc. or some combination thereof and I don’t think it even takes that long.
Also, it reminds me of the arguments against excel in a way. Most people know what a csv is, more or less how to open it, and don’t need too much context when discussing the file. Someone will quickly understand if you tell them the file isn’t delimited properly right away. These are pros that shouldn’t be taken for granted.
Again, I’m not saying csv doesnt have issues or that there aren’t better alternatives, simply that I didn’t find this particular argument convincing.
But the same is true for csv, and they are not readable by everyone since you don't always know how to read them, there is not enough info for that
Also it's not a good reflection on "deep experience" if it leads to reflexive defense of common stupid things people do with wrong analogies (e.g, flac is less efficient, so more like csv)
"You give up human readable files,..."
I was genuinely interested in some alternative suggestions - but the human readableness of csv is what makes it so sticky imo.
The evervescent suggestions of a brighter more logical, even obvious, solutions, is often a clear indicator of domain inexperience or ignorance.
CSV created tons of issues regarding encoding, value separation etc.
I started talking to our customers and were able to define interfaces with better and aligned format. json made my life easier.
In some senses, I think internet culture (maybe modern intellectual culture generally) gets stuck in these repetitive conversations.
Reprosecuting without seemingly knowing about all the previous times the conversation has been had.
I think that's a little unfair, it sounds like the author does have a decent amount of experience working with real-world CSV files:
I remember spending hours trying to identify an issue that caused columns to "shift" around 80% into a 40GB CSV file, and let me tell you, that just isn't fun.
If only CSV were CSV, as opposed to some form that's 80-90% CSV by line count with enough oddities to really make the parser ugly and hard to read.
See, the sweet spot isn't something completely unstructured, because then you feel justified in throwing up (your hands) and declaring defeat. The sweet spot is a file that's sufficiently close to being structured you can almost parse it nicely, but has enough bad lines you can't go in and fix them all by hand in a reasonable timeframe, and you can't tell upstream to get their shit in order because it's only a few lines.
But I’d say the error rate is actually very low, maybe .1-1% and nowhere near 10-20% of data being messed up.
"Of course, we can't conclude that you should never export to CSV. If your users are just going to try to find the quickest way to turn your data into CSV anyway, there's no reason why you shouldn't deliver that. But it's a super fragile file format to use for anything serious like data integration between systems, so stick with something that at the very least has a schema and is more efficient to work with."
It's the same with csv. They come in all kinds of formats because nobody agreed on the standard. Comma separated, semicolon separated, pipe separated, escaped, not escaped.
Everytime I have to deal with csv I first have to figure out how to parse it in code.
So I think the author is right, we must agree on a better format because that is what friends do.
You are also right because it's an illusion to think that this is going to change anythime soon. But who knows..
Plus, taking the data out of [proprietary software app my client's data is in] in csv is usually easy. Taking the data out in Apache Parquet is...usually impossible, but if it is possible at all you'll need to write the code for it.
Loading the data into [proprietary software app my client wants data put into] using a csv is usually already a feature it has. If it doesn't, I can manipulate csv to put it into their import format with any language's basic tools.
And if it doesn't work, I can look at the csv myself, because it's human readable, to see what the problem is.
90% of real world coding is taking data from a source you don't control, and somehow getting it to a destination you don't control, possibly doing things with it along the way. Your choices are usually csv, xlsx, json, or [shudder] xml. Looking at the pros and cons of those is a reasonable discussion to have.
It's also worth noting that SQLite can ingest CSV files into memory and perform queries on them directly - if the files are not too large, it's possible to bypass the sqlite format entirely.
I work with a provider who offers CSV exports as the only way to access data. Recently, we found they were including unsanitized user input directly in fields. They weren't even quoting these fields.
The platform "notified their quality assurance team ASAP" (like every other issue, we never heard back), but we had a deadline. This, of course, was a mess, but being able to quickly open the file and fix quotes was all it took. I shudder at the thought of trying to salvage a corrupted binary with zero help from the holder of the data.
The issue comes from CSV files looking easy to generate by hand, when it in fact is not.
In the French locale, the decimal point is the comma, so "121.5" is written "121,5". It means, of course, that the comma can't be used as a separator, so the semicolon is used instead.
It means that depending whether or not the tool that exports the CSV is localized or not, you get commas or you get semicolons. If you are lucky, the tool that imports it speaks the same language. If you are unlucky, it doesn't, but you can still convert it. If you are really unlucky, then you get commas for both decimal numbers and separators, making the file completely unusable.
There is a CSV standard, RFC 4180, but no one seems to care.
It’s unfortunate that there isn’t a single CSV format, but for historical reasons it is what it is. It’s effectively more like a family of formats that share the same file extension.
Excel actually has a convention where it understands when there is a line
sep=;
at the start of the file.By the way, in addition to differing separators, you can also get different character encodings.
Excel does understand a BOM to indicate UTF-8, but some versions of Excel unfortunately ignore it when the “sep=“ line is present…
Thank you! This is a game changer. As I was reading through these comments I was thinking how much better it would be if the separator could be specified in the file, but it would only be useful for my own stuff, if I was to do that on my own.
I’ll be trying this first thing tomorrow at work. I don’t do as much with CSVs as I used to, but am currently working with them a lot, and have a single entry that always throws the resulting Excel file off, and I can’t be bothered to figure out how to escape it (I spent too much time in the past messing with that and got nowhere). This sep= lines will hopefully solve my issues.
https://en.wikipedia.org/wiki/Decimal_separator
https://commons.wikimedia.org/wiki/File:DecimalSeparator.svg
Overall, semicolon feels like the superior delimiter.
Most sensible people don't export formatted numbers (e.g. 100.000,00), but even those are pretty trivial to import.
i.e.: ``` "1,20","2,3",hello "2,40","4,6",goodbye ```
If your tool reads CSV by doing `string_split(',', line);`, your tool is doing it wrong. There's a bunch of nuance and shit, which can make CSVs interesting to work with, but storing a comma in a field is a pretty solved issue if the tool in question has more than 5 minutes thought put into it.
You should probably use PSV - Point Separated Variable! Obviously we would need to adjust PSV to account for correct French word order (and actually use French correctly). Britain and USA would use something called PVS2 instead as a data interchange format with France which involves variable @@@ symbols as delimiters, unless it is a Friday which is undefined. The rest of the world would roll its eyes and use PSV with varying success. A few years later France would announce VSP on the back page of Le Monde, enshrine its use in law (in Aquitaine, during the fifteenth century) but not actually release the standard for fifteen years.
The world is odd. We have to work within an odd world.
Interestingly enough, you and I could look at a CSV encoded data set with commas as decimal separators and work out what is going on. You'll need a better parser!
Heh. Ah, HN. Always good for a laugh line.
The advantages of CSV are pretty massive though - if you support CSV you support import and export into a massive variety of business tools, and there is probably some form of OOTB support.
You have a (usually) portable transport format that can get the information into and out of an enormous variety of tools that do not necessarily require a software engineer in the middle.
I'm also struggling with such a quick dismissal of human readable formats. It's a huge feature.
What happens when there's a problem with a single CSV file in some pipeline that's been happily running fine for years? You can edit the thing and move on with your day. If the format isn't human readable, now you may have to make and push a software update to handle it.
Of course, CSV is a terrible format that can be horribly painful. No argument there.
But despite the pain, it's still far better than many alternatives. In many situations.
while IFS="$(printf \\a)" read -r field1 field2...
do ...
done
This works just as well as anything outside the range of printing characters.Getting records that contain newlines would be a bit trickier.
Parquet is the opposite of simple. Even when good libraries are available (which it usually isn't), it is painful to read a Parquet file. Try reading a Parquet file using Java and Apache Parquet lib, for example.
Avro is similar. Last I checked there are two Avro libs for C# and each has its own issues.
Until there is a simple format that has ubiquitous libs in every language, CSV will continue to be the best format despite the issues caused by under-specification. Google Protobuf is a lot closer than Parquet or Avro. But Protobuf is not a splitable format, which means it is not Big Data friendly, unlike Parquet, Avro and CSV.
I skimmed their docs a bit: https://parquet.apache.org/docs/
I would not look forward to implementing that.
It all seems rather complex, and even worse: not actually all that well described. I suppose all the information is technically there, but it's really not a well-design well-written specification that's easy to implement. The documentation seems like an afterthought.
This is probably why good libraries are rare.
> Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem
Nope and nope.
What, when I worked at Google concatenating protobuf strings was a common way to concatenate protobufs, they are absolutely splittable. People might not know it but there is a reason they are designed like they are, it is to handle big data as you say.
If you mean you can't split a single protobuf, sure, but you don't need to do that, it is like trying to split a single csv line, doesn't mean csv isn't splittable.
Edit: Maybe some of that tooling to work with protobufs are internal only or not a part of the external protobuf packages though.
I laughed. We don't call somebody writing a poor varint serializer for in-house use, then discovering that it doesn't handle negative numbers and floats that well so slapping a couple of hotfixes on top of it that make it necessary to have a protocol specification file, "design".
They support CSV but not your CSV.
For example, how does quoting work? Does quoting work?
They had never had a customer do X on Y field, so they never quoted it nor added code to quote it if needed..
Of course, we did X in one entry. Took me too long to find that which obviously messed up everything after.
Eh, I don't think that's the problem. If that was the problem, there are a zillion ways to chunk files; .tar is probably the most ubiquitous but there are others.
The bigger problem is that Protobuf is way harder to use. Part of the reason CSV is underspecified is it's simple enough it feels, at first glance, like it doesn't need specification. Protobuf has enough dark corners that I definitely don't know all of it, despite having used it pretty extensively.
I think Unicode Separated Values (USV) is a much better alternative and as another poster mentioned, I hope it takes off.
If you're exporting files for machine consumption, please consider using something more robust than CSV.
People are busy; instead of hinting "something more robust than CSV", mention the alternatives and show a comparison (load time/search time/compression ratio) summary graph. (Where is the knee of the curve?)
There's also an implicit assumption to each use-case about whether the data can/should fit in memory or not, and how much RAM a typical machine would have.
As you mention, it's pretty standard to store and access compressed CSV files as .csv.zip or .csv.gz, which mitigates at least trading off the space issue for a performance overhead when extracting or searching.
The historical reason a standard like CSV became so entrenched with business, financial and legal sectors is the same as other enterprise computing; it's not that users are ignorant; it's vendor and OS lock-in. Is there any tool/package that dynamically switches between formats internally? estimates comparative file sizes before writing? ("I see you're trying to write a 50Gb XLSX file...") estimates read time when opening a file? etc. Those sort of things seem worth mentioning.
Something more boring, like "Consider whether other options make more sense for your data exports than CSV". Plenty of people have suggested other good options in comments on this submission, such as for example sqlite. I think the post comes off as if I'm trying to sell a particular file format for all uses cases, when what I had in mind when writing it was to discourage using CSV as a default. CSV has a place, certainly, but it offloads a lot of complexity on the people who are going to consume the data, in particular, they need to figure out how to interpret it. This can't necessarily be done by opening the file in an editor and looking at it, beyond a certain size you're going to need programming or great tools to inspect it anyway.
I was given an initial export of ~100 poor quality CSV files totaling around 3TB (~5-6 different tables, ~50 columns in each) in size a few years back, and had to automate ingestion of those and future exports. We could've saved a lot of work if the source was able to export data in a friendlier format. It happened more than once during that project that we were sent CSVs or Excel sheets that had mangled data, such as zip codes or phone numbers with leading 0s removed. I think it is a good thing to inform people of these problems and encourage the use of formats that don't necessity guessing data types. :shrug:
> People are busy; instead of hinting "something more robust than CSV", mention the alternatives and show a comparison (load time/search time/compression ratio) summary graph. (Where is the knee of the curve?)
This might be an interesting thing to follow up later, but would require a lot more work.
Sometimes a click-baity title is what you need to get a decent conversation/debate going. Considering how many comments this thread got, I'd say you achieved that even if sparking a lengthy HN thread had never been your intent.
The problem with parquet is it's complicated and you basically have to remap from parquet to whatever you're importing into because the people on the other side have remapped from whatever to parquet.
There are likely relationships and constraints there that you'll have to hack around - which is harder to do because the parquet tools sort of suck/aren't as flexible.
With CSV you can hack around any problem in the ETL process.
String encoding? Dates? Formatted numbers? Booleans (T/F/Y/N/etc)? Nested quotes? Nested CSV!?
How about intermediate systems that muck things up. String encoding going through a pipeline with a misconfiguration in the middle. Data with US dates pasted into UK Excel and converted back into CSV, so that the data is a mix of m/d/yy and d/m/yy depending on the magnitude of the numbers. Hand-munging of data in Excel generally, so that sometimes the data is misaligned WRT rows and columns.
I've seen things in CSV. I once wrote an expression language to help configure custom CSV import pipelines, because you'd need to iterate a predicate over the data to figure out which columns are which (the misalignment problem above).
- treat it as a join, and unroll by duplicating non-nested CSV data in separate rows for every element in the nested CSV
- treat it as a projection, have an extraction operator to project the cells you want
- treat it as text substitution problem; I've seen CSV files where every line of CSV was quoted like it was a single cell in a larger CSV row
You get nested CSV because upstream systems are often master/detail or XML but need to use CSV because everybody understands CSV because it's such a simple file format. Good stuff.
CSV gives you the freedom of choice how to write the data.
You need the french data format? You can write the french date format. You need # as separator? You can use # as separator.
Sure you can't read any file without knowing it's specification but that's not the usecase of CSV.
CSV is for data transfer between systems which know each other.
Add another munging the to list, ids that 'look like numbers' e.g. `0002345` will get converted to `2,345`. Better be sure to pre-pend ' i.e. `'0002345`
Parsing RFC-compliant CSVs and telling clients to go away with non-compliant CSVs is not hard.
Parsing real world CSVs reliably is simply impossible. The best you can do is heuristics.
How do you interpret this row of CSV data?
1,5,The quotation mark "" is used...,2021-1-1
What is the third column? The RFC says that it should just be literally
> The quotation mark "" is used...
But the reality is that some producers of CSVs, which you will be expected to support, will just blindly apply double quote escaping, and expect you to read:
The quotation mark " is used...
Or maybe you find a CSV producer in the wild (let's say... Spark: https://spark.apache.org/docs/latest/sql-data-sources-csv.ht...) that uses backslash escaping instead.
5. Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.
And data type influencing is a fun side project. I worked for a while on a fully generic CSV to SQL converter. Basically you end up with regex matches for different formats and you keep a running tally of errors encountered and then do a best fit for the column. Using a single CSV file is consistent with itself for weird formatting induced by whatever process the other side used. It actually worked really well, even on multi gigabyte CSV files from medical companies that one of my clients had to analyze with a standard set of SQL reports.
Data exchange via CSV becomes a multiple party negotiation between software (sometimes the other side is "set in stone" such as a third party commercial system or Excel that is entrenched), users, and users' local configuration. However, while not well defined none of these are intractable from a software engineering point of view. And if your system is valuable enough and needs to ingest data from enough places it is NOT intractable to build auto detection capabilities for all the edge cases. I have done it. Is it sometimes a challenge, yes. Do project managers love it when you give 8 point estimates for what seems like it should be a simple thing, no. That is why the coding kata RFC compliance solution as a base first makes the most sense and then you troubleshoot as bug reports come in about data problems with various files.
If you are writing a commercial, enterprise solution that needs to get it right on its own every time, then that becomes a much bigger project.
But do you know what is impossible, getting the entire universe of other systems that your customers are using to support some new interchange format. Sorry, that system is no longer under active development. No one is around to make code changes. That is not compatible with the internal tools. For better or worse, CSV is the lingua franca of columns of data. As developers, we deal with it.
And yes, do support Excel XLSX format if you can. There are multiple third party libraries to do this in various languages of various quality.
As a developer, I have made a lot of my living dealing with this stuff. It can be a fun challenge, frustrating at time, but in the end as professionals we solve what we need to to get the job done.
A nice alternative I've used often is to constructor an excel table and then giving is an .xls extension, which Excel happily accepts and has requires much less user explanation than telling individual users how to get Excel to correctly parse a CSV.
* What does missing data look like? The empty string, NaN, 0, 1/1-1970, null, nil, NULL, \0?
* What date format will you need to parse? What does 5/5/12 mean?
* How multiline data has been written? Does it use quotation marks, properly escape those inside multiline strings, or maybe it just expects you to count the delimiter and by the way can delimiters occur inside bare strings?
And let me add my own question here:
what is the actual delimiter? Do you support `,`, `;` and `\t`?
What is the decimal delimiter “.”, “,”?
Most csv users don’t even know they have to be aware of all of these differences.
I am aware you can import a csv file in excel by manually defining the column types but few people use that.
I'd be fine with an extension of the csv format with one extra top row to define the type of each column.
And what fraction of those users would be able to anything with another format?
> an extension of the csv format with one extra top row to define the type of each column
If the goal is foolproof export to Excel, use XLSX.
Especially if you work with teams from different countries, csv is hell. I always generate rfc compliant csv, not once it was accepted from day one. Once, it took us two weeks to make it pass the ingestion process (we didn't have access to the ingest logs and had to request them each day, after the midnight processing) so in the end, it was only 10 different tries, but still.
I had once an issue with json (well, not one created by me) , and it was clearly a documentation mistake. And I hate json (I'm an XML proponent usually). Csv is terrible.
Parsing the CSV you have in front of you is not hard (usually). Writing a parser that will work for all forms of CSV you might encounter is significantly harder.
This is the problem though. Everything thinks it is "obvious" and does their own broken implementation where they just concatenate values together with commas, and then outsources dealing with the garbage to whoever ends up with the file on their plate.
If a more complex, non-obvious format was required, instead of "easy I'll just concatenate values" they might actually decide to put engineering into it (or use a library)
And yet you can anyway if you are confident that your CSV won't contain anything that would mess it up.
No.
Of course there are better formats. But for many use cases friends encourage friends to export to CSV.
Its a little harder to load into a spreadsheet, but in my experience, way easier to reliably parse in any programming language.
But outside the data science field, my experience working on software programming these years is that it won't matter how beautiful your backoffice dashboards and web apps are, many non-technical business users will demand at some point CSV import and/or export capabilities, because it is easier for them to just dump all the data on a system into Excel/Sheets to make reports, or to bulk edit the data via export-excel-import rather than dealing with the navigation model and maybe tens of browser tabs in your app.
I treat formats such as parquet as I generally do: I try to allow various different inputs, and produce standard outputs. Parquet is something I allow purely as an optimization. CSV is the common default all of my tools have (UTF-8 without BOM, international locale, comma separator, quoting at the start of the value optional, standards-compliant date format or unix timestamps). Users generally don't have any issue with adapting their files to that format if there's any difference.
- different physical types to store timestamps: INT96 vs INT64
- different ways to interpret timestamps before tzdb (current vs earliest tzdb record)
- different ways to handle proleptic Gregorian dates and timestamps
- different ways to handle time zones (since Parquet only has the equivalents of LocalDateTime and Instant, but no OffsetDateTime or ZonedDateTime and earlier versions of Hive 3 were terribly confused which is which)
- decimal data type was written differently, as a byte array in older versions and as int/byte array/binary in the newer ones
- Hadoop ecosystem doesn't support decimals longer than 38 digits, but the file format supports them
This line is emblematic of the paradigm shift LLMs have brought. It’s now easier to build a better tool than change everyone’s behaviour.
> You give up human readable files, but
What are we even doing here.
Because when I'm wrangling data from a human - a human who is stubbornly defending their own little island of business information like their employment depended on it[1] - a CSV is about as good as I am gonna get.
I had a bear of a time just convincing people to put their data in a delimited format, instead of a table inside a powerpoint presentation, or buried in sixty levels of Access joins, or in an SVG. I need data from "what is scroll wheel" sort of users.
If I am working system to system? That's a different requirement, a requirement that is apparently where the OP author is coming from.
[1] Because it kind of does. Having a unique platform is one of those priceless keys to being skipped in the thrice-yearly layoff rituals. Unfortunately, that means anyone approaching saying words like "integration" or "API" are shot on sight.
How do you verify it? What happens next?
Even Excel can handle it.
It is far safer to munge data containing tabs (convert to spaces, etc), than commas (remove? convert to dots? escape?).
The better answer is to use ASCII separators as Lyndon Johnson intended, but that turns out to be asking a lot of data producers. Generating TSV is usually easier than generating CSV.
This may shock some people but Excell and other MS Office products aren't that common any more to have around for developers. Through the nineties, MS had a defacto monopoly on desktop back in the day. But these days, a lot of developers use macs and they are also popular with business people. Google docs seems popular with startups. All companies I've been in for the last 12 years default to that.
Anyway, I've done this on a few teams where it just short cuts the whole discussion about needing a bespoke UI to edit some table of stuff for managers. I've even replaced a failed project to build such a UI with simple stuff like this. All you need is a bit of validation when importing and you can keep the last know good export in git and do pull requests to update the file. There's a weird dynamic where giving this level of control to some managers actually makes them feel more involved and engaged.
My preferred format to work with is actually ndjson (newline delimited json). Mainly because it's easier to deal with nested objects and lists in that. Whenever people start putting lists in table cells or start adding dots or other separators to their column names to indicate some kind of hierarcy, ndjson is the better solution. I've seen all sorts of attempts by business people to stuff non tabular data into a table.
With proper escaping the separator doesn't matter.
After that csv generating is pretty easy.
Last time I checked it can't read it without errors if you just open the file with it without using the import function an explicitly select UTF8.
Commas are, by far, better delimeters than tabs in the grand scheme of things and with both expert and regular users considered.
I like that everyone knows about .csv files, and it's also completely human readable.
So for <100mb I would still use csv.
2. The only data type in CSV is a string. There is no null, there are no numbers. Anything else must be agreed upon between producer and consumer (or more commonly, a consumer looks at the CSV and decides how the producer formatted it). JSON also doesn’t include dates, you’re not going to see people start sending API responses as Apache Parquet. CSV is fiiine.
I feel like USV (Unicode Separated Values) neatly improves CSV while maintaining most of its benefits.
Files in .csv.gz are more compact than anything else, unless you have some very-very specific field of work and a very compressible data. As far as I remember, Parquet files are larger than CSV with the same data.
Working with the same kind of data in Rust, I see everything saved and loaded in CSV is lightning fast. The only thing you may miss is indexing.
Whereas saving to binary is noteably slower. A data in generic binary format becomes LARGER than in CSV. (Maybe if you define your own format and write a driver for it, you'll be faster, but that means no interoperability at all.)
A quick benchmark [0] shows that saving to GeoPackage, FlatGeobuf, and GeoParquet are roughly 10x faster than saving to CSV. Additionally, the CSV is much larger than any other format.
[0]: https://gist.github.com/kylebarron/f632bbf95dbb81c571e4e64cd...
What's the current consensus? Can you link to a summary article?
(Some still say GeoPackage is: https://mapscaping.com/shapefiles-vs-geopackage/ )
Like JSON.
CSV is just too simple and "user-friendly".
I suspect that not everybody here work exclusively with huge datasets and well-defined data pipelines.
On a practical side, if I want to follow suggestions, how do I export to Avro from Numbers/Excel/Google Sheets?
1. CSV is underspecified Okay, so specify it for your use case and you're done? E.g use rfc3339 instead of the straw-man 1-1-1970 and define how no value looks like, which is mostly an empty string.
2. CSV files have terrible compression and performance Okay, who in their right mind uses a plain-text-file to export 50gb of data? Some file systems don't even support that much. When you are at the stage of REGULARLY shipping around files this big, you should think about a database and not another filetype to send via mail. Performance may be a point, but again, using it for gigantic files is wrong in the first place.
3. There's a better way (insert presentation of a filtype I have never heard of) There is lots of better ways to do this, but: CSV is implemented extremely fast, it is universally known unlike Apache Parquet (or Pickle or ORC or Avro or Feather...) and it is humanly readable.
So in the end: Use it for small data exports where you can specify everything you want or like everywhere, where you can import data, because most software takes CSV as input anyway.
For lots of data use something else.
Friends don't let friends write one-sided articles.
Like NHS with their post covid data due to excel limitations or gene name conversion problems in sci journals.
Same happens with stupid amount of laboratory management things or bioinformatics tools.
Honestly obviously the article is biased but we should at least think about moving away from csv in non customer facing fronts.
Small files? Json Big files? SQLlite or parquet.
https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...
install.packages("arrow")
and then I did ?read_parquet
to get an example. I tried the example, and got the error message as follows. This sort of error is really quite uncommon in R. So my answer to the "with little effort" is "no", at least for R. > tf<-tempfile()
> write_parquet(mtcars, tf)
Error in parquet___WriterProperties___Builder__create() :
Cannot call parquet___WriterProperties___Builder__create(). See https://arrow.apache.org/docs/r/articles/install.html for help installing Arrow C++ libraries.Nor can you control the format others want. The reason I have to export to csv, is unfortunately because the people I ship out to use excel for everything - and even though excel does support many different data formats, they either enjoy using .csv (should be mentioned that the import feature in excel works pretty damn well), or have some system written in VBA that parses .csv files.
[please kill me!]
Nothing beats CSV in terms of simplicity, minimal friction, and ease of exploring across diverse teams.
CSV does not deserve the hate.
Instead of breaking CSV by adding to it .. I recommend augmenting it :
It would be useful to have a good standardized / canonical json format for things like encoding, delimiter, schema and metadata, to accompany a zipped csv file, perhaps packaged in the same archive.
Gradually datasets would become more self-documenting and machine-usable without wrangling.
We already have that. Dan Brickley and others put a lot of thoughtful effort into it <https://www.w3.org/TR/tabular-data-primer/#dialects>:
> A lot of what's called "CSV" that's published on the web isn't actually CSV. It might use something other than commas (such as tabs or semi-colons) as separators between values, or might have multiple header lines. [...] You can provide guidance to processors that are trying to parse those files through the `dialect` property
As is usually the case with standards, it's not that the standard doesn't exist but that people just don't even bother checking (much less caring about what it says or actually trying to follow it).
My use case is that other people can’t or won’t read anything but plain text.
Like I've recently found it much easier to deal with schema evolution in pyspark with a lot of historical CSVs than historical parquets. This is essentially a pyspark problem, but if everything works worse with your data format then maybe it's the format that's the problem. CSV parsing is always and everywhere easy, easier than the problems parquets often throw up.
The only time I'd recommend parquet is if you're setting up a pipeline with file transfer and you control both ends... but that's the easiest possible situation to be in; if your solution only works when it's a very easy problem then it's not a good solution.
- data that has predictable value types (mostly numbers and short labels would be fine), e.g. health data about a school class wouldn't involve random binary fields or unbounded user input
- data that has a predictable, managable length — e.g. the health data of the school class wouldn't be dramatically longer than the number of students in that class
- data with a long sampling period. If you read that dataset once a week performance and latency become utterly irrelevant
- if the shape of your data is already tabular and not e.g. a graph with many references to other rows
- if the gain in human readability and compatibility for the layperson outweighs potential downsides about the format
- if you use a sane default for encoding (utf8, what else), quoting, escaping, delimiter etc.
Every file format is a choice, often CSV isn't the wrong one (but: very often it is).
Dumping to CSV is built into MySQL and Postgres (though MySQL has better support), is faster on export and much faster on import, doesn't fill up the file with all sorts of unneeded text, can be diffed (and triangulated by git) line by line, is human readable (eg. grepping the CSV file) and overall makes for a better solution than mysqldumping INSERTs.
In Docker, I can import millions of rows in ~3 minutes using CSV; far better than anything else I tried when I need to mock the whole DB.
I realize that the OP is more talking about using CSV as a interchange format or compressed storage, but still would love to hear from others if my love of CSV is misplaced :)
Either that or use the ASCII codes for field and row delimiters on a UTF-8 file without a BOM.
Even then you're still stuck with data encoding issues with numbers and booleans. And that direct even cover all the holes I've seen in CSV in real world use by banks and govt agencies over the years.
When I've had to deal with varying imports I push for a scripted (js/TS or Python) preprocessor that takes the vender/client format and normalized to line delimited JSON, then that output gets imported. It's far easier than trying to create a flexible importer application.
Edit: I've also advocated for using SQLite3 files for import, export and archival work.
I've never come across this Parquet-format, is it grep:able? Gzip:ed CSV is. Can a regular bean counter person import Parquet into their spreadsheet software? A cursory web search indicates they can't without having a chat with IT, and SQL might be easier while XML seems pretty straightforward.
Yes, CSV is kind of brittle, because the peculiarities with a specific source is like an informal schema but someone versed in whatever programming language makes this Parquet convenient won't have much trouble figuring out a CSV.
If you want csv, considering a normalization step. For instance, make sure numbers have no commas and a "." decimal place. Probably quote all strings. Ensure you have a header row.
Probably don't reach for a CSV if:
- You have long text blobs with special characters (ie quotes, new lines, etc.)
- You can't normalize the data for some reason (ie some columns have formulas instead of specific data)
- You know that every user will always convert it to another format or import it
It's "Comma Separated Variables", it doesn't really need anymore specification than that.
These files have always imported into M$ and libre office suites without issue.
Normally, I wouldn't nitpick (hold for laughter) but this is just a perfect example of how CSV is not CSV and how it's probably impossible to support everything people call CSV, simply because of the tiny (or not so tiny) differences between formats.
- supporting "" (single)
- Supporting newlines in "", oops, now you can't getline() and instead need to getdelim()
- Supporting comments # (why is this even a thing)
- Supporting multiple "" in a field
- Escaping " with "" or \"
- length based csv, so all fields are seekable.
It's a mess, which one's your csv?
I think Hanlon's razor applies in that situation.
Outside of the context of end user, I don't see any advantages in this compared to xml or json export.
Not as valuable as human-readable files.
And what kind of monstrous CSV files has this dude been working with? Data types? Compression? I just need to export 10,000 names/emails/whatevers so I can re-import them elsewhere.
Like, I guess once you start hitting GBs, an argument can be made, but this article sounds more like "CSV considered harmful", which is just silly to me.
No query params in their single endpoint and only csv exports possible.
Then add to that, that shopify, apparently the leader or whatever in shopping software, can't do better than require exactly the format they say, don't you dare coming with configurable fields or mapping.
The industry is stuck in the 00s, if not 90s.
When you hit the inevitable "friends don't let friends" or "considered harmful" type of people, it's time to move quickly past them and let the actual situation dictate the best solution.
Lost me right there. It has to be human readable.
Cf a tweet I wrote in one of my past lives: https://x.com/datao/status/1572226408113389569?s=20
https://www.reddit.com/r/vim/comments/1bo41wk/entering_and_d...?
- the sender can produce it incrementally
- the receiver can begin processing it as soon as the first byte arrives (or, more roughly, unescaped newline)
- gzip compression works without breaking the streaming nature
Yeah, it’s a flawed interchange format. But in a closed system over HTTP it’s brilliant.
Yes CSV is a pain in many regards, but many of the difficulties with it arise from the fact that anybody can produce it with very little tool support - which is also the reason it is so widely used.
Recommending a decidedly niche format as an alternative is not going anywhere.
Most of the time you know the source pretty well and can simply ask about the value range.
Obviously there are edges, but they're edges by nature. And like you say, you usually know the source pretty well.
Also is backwards-compatible with most CSV parsers.
Don't come here to work if you don't want to drown in nitpicking and meaningless debates like this.
File formats like CSV will outlast religion.
Also, the csv has a very pythonic interface vis-a-vis parquet, in most cases if I can fit the file in memory I would go with CSV.
Try telling 45 year old salesman he needs to export his data in parquet. "Why would I need to translate it to French??"
I feel like I'm pretty up to date on stuff, and I've never heard of parquet or seen in as an option, in any software, ever.
Also some UPCs start with 0
Worst is when Excel saves the scientific notation back to the CSV, overwriting the correct number.
2. Enable the legacy Text Import Wizard as per [0]
3. Go to Data -> Get Data -> Legacy Wizards -> From Text (Legacy)
4. Set config based on your CSV file, typically select "Delimited" and "My data has headers" enabled
5. Click Next and pick the delimiter, typically "Comma"
6. Click Next and click the columns with UPCs, select "Text" in the "Column data format" area
7. Click Finish
(I'm not saying this is great, just sharing how to do it in case you don't know)
[0] https://professor-excel.com/import-csv-text-files-excel/
edit: you can also set up a PowerQuery query that will always open some CSV at some path and apply this config, but I don't want to have anything to do with PowerQuery, sorry.
I wish TSVs were more popular though. Tabs appear less frequently than commas in data.
My biggest recommendation is to avoid Excel! It will mangle your data if you let it.
Screams the rustocean from his ivory tower
The alternative path involves parsing csv in order to turn it into a different csv, turning json into yaml and so forth. Parsing "human readable" formats is terrible relative to parsing XML. Go with the unambiguous source format and turn it into whatever is needed in various locations as required.