I am on the receiving end: I have to parse CSV generated by various (very expensive, very complicated) eCAD software packages. And it's often garbage. Those expensive software packages trip on things like escaping quotes. There is no way to recover a CSV line that has an unescaped double quote.
I can't point to a strict spec and say "you are doing this wrong", because there is no strict spec.
Then there are the TSV and semicolon-Separated V variants.
Did I mention that field quoting was optional?
And then there are banks, which take this to another level. My bank (mBank), which is known for levels of programmer incompetence never seen before (just try the mobile app) generates CSVs that are supposed to "look" like paper documents. So, the first 10 or so rows will be a "letterhead", with addresses and stuff in various random columns. Then there will be your data, but they will format currency values as prettified strings, for example "34 593,12 USD", instead of producing one column with a number and another with currency.
Free database like Postgres? Perfect every time.
Big complex 6-figure e-discovery system? Apparently written by someone who has never heard of quoting, escaping or the difference between \n and \r and who thinks it's clever to use 0xFF as a delimiter, because in the Windows-1252 code page it looks like a weird rune and therefore "it won't be in the data".
There would have been many better separators... but good idea to bring formatting into it as well...
To be fair, that's not a problem with CSV but with the provider's lack of data literacy.
Option a worked fine so long as none of the names in the spreadsheet had any non-ASCII characters.
Option d was by some measures the worst (and was definitely the largest file size), but it did seem to consistently work in Excel and Libre Office. In fact, they all worked without any issue in Libre Office.
Have you tried RFC 4180?
-different delimiters (comma, semi-colon, tab, pipe etc)
-different encodings (UTF8, UTF16 etc)
-different line ending (CR, LF, CR+LF)
-ragged rows
-splitting and merging columns
And much more besides.
However, if you have either:
-line feeds and/or carriage returns in data values, but no quoting
or
-quoting, but quotes in data values aren't properly handled
Then you are totally screwed and you have my sympathies!
I use the industry standard that everyone understands: ECMA-376, ISO/IEC 29500 aka .xlsx.
Nobody has any problems producing or ingesting .xlsx files. The only real problem is the confusion between numbers and numeric text that happens when people use excel manually. For machine to machine communication .xlsx has never failed me.
But CSV under controlled circumstances is very simple.
And speaking of Wintendo, the bonus is often that you can go straight from CSV to Excel presentation for the middle management.
Unlike XML or JSON, there isn't a document defining the grammar of well-formed or valid CSV files, and there are many flavours that are incompatible with each other in the sense that a reader for one flavour would not be suitable for reading the other and vice versa. Quoting, escaping, UTF-8 support are particular problem areas, but also that you cannot tell programmatically whether line 1 contains column header names or already data (you will have to make an educated guess but there ambiguities in it that cannot be resolved by machine).
Having worked extensively with SGML for linguistic corpora, with XML for Web development and recently with JSON I would say programmatically, JSON is the most convenient to use regarding client code, but also its lack of types makes it useful less broadly than SGML, which is rightly used by e.g. airlines for technical documntation and digital humanities researchers to encode/annotate historic documents, for which it is very suitable, but programmatically puts more burden on developers. You can't have it all...
XML is simpler than SGML, has perhaps the broadest scope and good software support stack (mostly FOSS), but it has been abused a lot (nod to Java coders: Eclipse, Apache UIMA), but I guess a format is not responsible for how people use or abuse it. As usual, the best developers know the pros and cons and make good-taste judgments what to use each time, but some people go ideological.
(Waiting for someone to write a love letter to the infamous Windows INI file format...)
jq supports several pseudo-JSON formats that are quite useful like record separator separated JSON, newline separated JSON. These are obviously out of spec, but useful enough that I've used them and sometimes piped them into a .json file for storage.
Also, encoding things like IEEE NaN/Infinity, and raw byte arrays has to be in proprietary ways.
“No one owns CSV. It has no real specification (yes, I know about the controversial ex-post RFC 4180), just a set of rules everyone kinda agrees to respect implicitly. It is, and will forever remain, an open and free collective idea.”
I actually miss that. It was nice when settings were stored right alongside your software, instead of being left behind all over a bloated registry. And the format was elegant, if crude.
I wrote my own library for encoding/writing/reading various datatypes and structure into ini's, in a couple different languages, and it served me well for years.
https://github.com/medialab/xan/blob/master/docs/LOVE_LETTER...
Also these people have only ever had to deal with the American Excel localization.
So yeah, with the caveat of "only ever use Excel and only ever the American edition" CSV is pretty nice.
> Unlike XML or JSON, there isn't a document defining the grammar of well-formed or valid CSV files,
There is, actually, RFC 4180 IIRC.
> there are many flavours that are incompatible with each other in the sense that a reader for one flavour would not be suitable for reading the other and vice versa.
"There are many flavours that deviate from the spec" is a JSON problem too.
> you cannot tell programmatically whether line 1 contains column header names or already data (you will have to make an educated guess but there ambiguities in it that cannot be resolved by machine).
Also a problem in JSON
> Quoting, escaping, UTF-8 support are particular problem areas,
Sure, but they are no more nor no less a problem in JSON as well.
There is such a document: RFC 4180. It may not be a good document, but it does exist.
Haven’t been a full time Python dev in sometime though, it seems TOML has supplanted that, but I remember thinking how interesting it was that Python had a built in INI parser and serializer
Despite JSON may also be interpreted differently by different tools, it is a good default choice for communicating between programs
When a JSON API turns out to have optional fields it usually shows through trial and error, and unlike CSV it's typically not considered a bug you can expect the API owner to fix. In CSV 'missing data' is an empty string rather than nulls or their cousins because missing fields aren't allowed, which is nice.
I also like that I can write my own ad hoc CSV encoder in most programming languages that can do string concatenation, and probably also a suitable decoder. It helps a lot in some ETL tasks and debugging. Decent CSV also maps straight to RDBMS tables, if the database for some reason fails at immediate import (e.g. too strict expectations) into a newly created table it's almost trivial to write an importer that does it.
You would think that ie XML-defined WSDL with XSD schema is well battle proven. I've encountered 2 years ago (and still dealing with that) WSDL from a major banking vendor that is technically valid, but no open source library in Java (from all languages) was able to parse it successfully or generate binding classes out of box.
Heck, flat files can end up with extreme cases, just work enough with legacy banking or regulatory systems and you will see some proper shit.
The thing is, any sort of critical integration needs to be battle tested and continuously maintained, otherwise it will eventually go bad, even a decade after implementation and regular use without issues.
Honestly, it’s fine. TOML is better if you can use it, but otherwise for simple applications, it’s fine. PgBouncer still uses INI, though that in particular makes me twitch a bit, due to discovering that if it fails to parse its config, it logs the failed line (reasonable), which can include passwords if it’s a DSN string.
1. It's tricky to parallelise processing of CSV. 2. A small amount of data corruption can have a big impact on the readability of a file (one missing or extra quote can bugger the whole thing up).
So these days for serialisation of simple tabular data I prefer plain escaping, e.g. comma, newline and \ are all \-escaped. It's as easy to serialise and deserialise as CSV but without the above drawbacks.
If you want CSV-ish, enforce an array of strings for each record. Or go further with actual objects and non-string types.
You can even jump to an arbitrary point and then seek till you see an actual new line as it’s always a record boundary.
It’s not that CSV is an invalid format. It’s that libraries and tools to parse CSV tend to suck. Whereas JSON is the lingua franca of data.
It seems many possible designs would've avoided the quoting chaos and made parsing sort of trivial.
For my own parser, I made everything `\` escaped: outside of a quote or double-quote delimited string, any character prefixed with a `\` is read verbatim. There are no special exceptions resulting in `\,` producing a comma while `\a` produces `\a`. This makes it a good rule, because it is only one rule with no exceptions.
CSV can just as easily support escaping as any other format, but there is no agreement for a CSV format.
After all, a missed escape can just as easily destroy a JSON or XML structure. And parallel processing of text is already a little sketchy simply because UTF-8 exists.
I mean, you don't usually parallelize reading a file in the first place, only processing what you've already read and parsed. So read each record in one process and then add it to a multiprocessing queue for multiple processes to handle.
And data corruption is data corruption. If a movie I'm watching has a corrupted bit I don't mind a visual glitch and I want it to keep playing. But with a CSV I want to fix the problem, not ignore a record.
Do you really have a use case where reading itself is the performance bottleneck and you need to parallelize reading by starting at different file offsets? I know that multiple processes can read faster from certain high-end SSD's than just one process, but that's a level of performance optimization that is pretty extraordinary. I'm kind of curious what it is!
There's a dearth of good resources about this around the web, this is the best I've come across: https://georgemauer.net/2017/10/07/csv-injection.html
The worst part about CSV, anyone can ride a parser in about 30 minutes, meaning that it's very easy to get incorrect implementations, incorrect data, and other strange undefined behaviors. But to be clear json, and yaml also have issues with everyone trying to reinvent the wheel constantly. XML is rather ugly, but it seems to be the most resilient.
So I cannot trust XML in depth, and depend on using a library that bought the spec and hopefully adheres to it.
Many of the criticisms of CSV I'm reading here boil down to something like: CSV has no authoritative standard, and everyone implements it differently, which makes it bad as a data interchange format.
I agree with those criticisms when I imagine them from the perspective of a user who is not also a programmer. If this user exports a CSV from one program, and then tries to load the CSV into a different program, but it fails, then what good is CSV to them?
But from the perspective of a programmer, CSV is great. If a client gives me data to load into some app I'm building for them, then I am very happy when it is in a CSV format, because I know I can quickly write a parser, not by reading some spec, but by looking at the actual CSV file.
Parsing CSV is quick and fun if you only care about parsing one specific file. And that's the key: It's so quick and fun, that it enables you to just parse anew each time you have to deal with some CSV file. It just doesn't take very long to look at the file, write a row-processing loop, and debug it against the file.
The beauty of CSV isn't that it's easy to write a General CSV Parser that parses every CSV file in the wild, but rather that its easy to write specific CSV parsers on the spot.
Going back to our non-programmer user's problem, and revisiting it as a programmer, the situation is now different. If I, a programmer, export a CSV file from one program, and it fails to import into some other program, then as long as I have an example of the CSV format the importing program wants, I can quickly write a translator program to convert between the formats.
There's something so appealing about to me about simple-to-parse-by-hand data formats. They are very empowering to a programmer.
This is fine if you can hand-check all the data, or if you are okay if two offsetting errors happen to corrupt a portion of the data without affecting all of it.
Also I find it odd that you call it "easy" to write custom code to parse CSV files and translate between CSV formats. If somebody give you a JSON file that isn't valid JSON, you tell them it isn't valid, and they say "oh, sorry" and give you a new one. That's the standard for "easy." When there are many and diverse data formats that meet that standard, it seems perverse to use the word "easy" to talk about empirically discovering the quirks in various undocumented dialects and writing custom logic to accommodate them.
Like, I get that a farmer a couple hundred years ago would describe plowing a field with a horse as "easy," but given the emergence of alternatives, you wouldn't use the word in that context anymore.
They're not without their drawbacks, like no official standards etc, but they do their job well.
I will be bookmarking this like I have the ini critique of toml: https://github.com/madmurphy/libconfini/wiki/An-INI-critique...
I think the first line of the toml critique applies to CSV: it's a federation of dialects.
Until I worked quite a bit with them and realized that there's always schema in the data, otherwise it's just random noise. The question is who maintains the schema, you or a dbms.
Re. formats -- the usefulness comes from features (like format enforcing). E.g. you may skip .ini at all and just go with lines on text files, but somewhere you still need to convert those lines to your data, there's no way around it, the question is who's going to do that (and report sane error messages).
My experience has indicated the exact opposite. CSVs are the only "structured" format nobody can claim to parse 100% (ok probably not true thinking about html etc, just take this as hyperbole.) Just use a well-specified format and save your brain-cells.
Occasionally, we must work with people who can only export to csv. This does not imply csv is a reasonable way to represent data compared to other options.
I can tell an MBA guy to upload a CSV file and i'll take care of it. Imagine i tell him i need everything in a PARQUET file!!! I'm no longer a team player.
Do the same with a .CSV file and you'll have to teach those people how to use the .CSV importer in Excel and also how to set up the data types for each column etc. It's a non trivial problem that forces you down to a few million people.
.CSV is a niche format for inexperienced software developers.
Thanks
On top of all the problems people mention here involving the precise definition of the format and quoting, it's outright shocking how long it takes to parse ASCII numbers into floating point. One thing that stuck with me from grad school was that you could do a huge number of FLOPS on a matrix in the time it would take to serialize and deserialize it to aSCII.
For a long time I was very wary of CSV until I learnt Python and started using it's excellent csv standard library module.
I've since started looking at parquet files – which turned out to not be friendly to append-only operations. I've ended up implementing writing events into ipc files which then periodically get "flushed" into the parquet files. It works and it's efficient – but man is it non-trivial to implement properly!
My point here is: for a regular developer – CSV (or jsonl) is still the king.
Did you try setting `PRAGMA synchronous=FULL` on your connection? This forces fsync() after writes.
That should be all that's required if you're using an NVMe SSD.
But I believe most microSD cards do not even respect fsync() calls properly and so there's technically no way to handle power offs safely, regardless of what software you use.
I use SanDisk High Endurance SD cards because I believe (but have not fully tested) that they handle fsync() properly. But I think you have to buy "industrial" SD cards to get real power fail protection.
SQLite was fine until the realities of that environment hit.
0) I need to save the most data over time and my power budget is unpredictable due to environmentals. 1) When should I commit? SQLite commit per insert slows down, impacts battery life, impacts sample rate. Practically you could get away with batching all data for a small period. 2) SQLite is slow to repair databases. Partially written file would often take longer to repair than we had battery to run.
CSV based format filled that niche. First column was line-column count to support firmware upgrades. Last column is line-checksum. Another column indicating if this line was the last for an event. Parser skips corrupted lines/entries.
I think the industry standard for supporting this is something like iceberg or delta, it's not very lightweight, but if you're doing anything non-trivial, it's the next logical move.
Compare this CSV
field1,field2,fieldN
"value (0,0)","value (0,1)","value (0,n)"
"value (1,0)","value (1,1)","value (1,n)"
"value (2,0)","value (2,1)","value (2,n)"
To the directly-equivalent JSON [["field1","field2","fieldN"],
["value (0,0)","value (0,1)","value (0,n)"],
["value (1,0)","value (1,1)","value (1,n)"],
["value (2,0)","value (2,1)","value (2,n)"]]
The JSON version is only marginally bigger (just a few brackets), but those brackets represent the ability to be either simple or complex. This matters because you wind up with terrible ad-hoc nesting in CSV ranging from entries using query string syntax to some entirely custom arrangement. person,val2,val3,valN
fname=john&lname=doe&age=55&children=[jill|jim|joey],v2,v3,vN
And in these cases, JSON's objects are WAY better.Because CSV is so simple, it's common for them to avoid using a parsing/encoding library. Over the years, I've run into this particular kind of issue a bunch.
//outputs `val1,val2,unexpected,comma,valN` which has one too many items
["val1", "val2", "unexpected,comma", "valN"].join(',')
JSON parsers will not only output the expected values every time, but your language likely uses one of the super-efficient SIMD-based parsers under the surface (probably faster than what you are doing with your custom CSV parser).Another point is standardization. Does that .csv file use commas, spaces, semicolons, pipes, etc? Does it use CR,LF, or CRLF? Does it allow escaping quotations? Does it allow quotations to escape commas? Is it utf-8, UCS-2, or something different? JSON doesn't have these issues because these are all laid out in the spec.
JSON is typed. Sure, it's not a LOT of types, but 6 types is better than none.
While JSON isn't perfect (I'd love to see an official updated spec with some additional features), it's generally better than CSV in my experience.
[["field1","field2","fieldN"],
["value (0,0)","value (0,1)","value (0,n)"],
["value (1,0)","value (1,1)","value (1,n)"],
["value (2,0)","value (2,1)","value (2,n)"]]
I think it's a big stretch to use that JSON for comparison. In practice, one is much more likely to see this: [
{
"field1": "value (0,0)",
"field2": "value (0,1)",
"fieldN": "value (0,n)"
},
{
"field1": "value (1,0)",
"field2": "value (1,1)",
"fieldN": "value (1,n)"
},
{
"field1": "value (2,0)",
"field2": "value (2,1)",
"fieldN": "value (2,n)"
}
]A but unfair to compare CSV without parser library to JSON with library.
Unless you need appendability, but then you should probably just use NDJSON/JSONL for a lot of cases.
"8. Reverse CSV is still valid CSV" is not true if there are header rows for instance.
But really, whether or not CSV is a good format or not comes down to how much control you have over the input you'll be reading. If you have to deal with random CSV from "in the wild", it's pretty rough. If you have some sort of supplier agreement with someone that's providing the data, or you're always parsing data from the same source, it's pretty fine.
(That is, of course, assuming the file doesn't contain newlines or other tabs inside of fields. The format should use \t \n etc for those. What a missed opportunity.)
And all kinds of other weirdness, right in ascii. Vertical tabs, LOL. Put those in filenames on someone else's computer if you want to fuck with them. Linux and its common file systems are terrifyingly permissive in the character set they allow for file names.
Nobody uses any of that stuff, though.
As part of code review, the developer of the feature must be able to roundtrip export -> import a realistic test dataset using the same program and workflow that they expect a consumer of the data to use. They have up to one business day to accomplish this task, and are allowed to ask an end user for help. If they don't meet that goal, the PR is sent back to the developer.
What's fascinating about the exercise is that I've bounced as many "clever" hand-rolled CSV exporters (due to edge cases) as other more advanced file formats (due to total incompatibility with every COTS consuming program). All without having to say a word of judgment.
Data export is often a task anchored by humans at one end. Sometimes those humans can work with a better alternative, and it's always worth asking!
Most features listed in the document are also shared by JSONL, which is my favourite format. It compresses really well with gzip or zstd. Compression removes some plain-text advantages, but ripgrep can search compressed files too. Otherwise, you can:
zcat data.jsonl.gz | grep ...
Another advantage of JSONL is that it's easier to chunk into smaller files.This whole thread is an uninformed rehash of bad ideas.
csvtk: https://bioinf.shenwei.me/csvtk/
gawk: https://www.gnu.org/software/gawk/manual/html_node/Comma-Sep...
awk: https://github.com/onetrueawk/awk?tab=readme-ov-file#csv
csvquote: https://github.com/dbro/csvquote
Especially for use with existing shell text processing tools, eg. cut, sort, wc, etc.
But of course, I am partial ;)
At any medium+ sized company, you’ll find huge amounts of CSVs being passed around, either stitched into ETL pipelines or sent manually between teams/departments.
It’s just so damn adaptable and easy to understand.
Like a rapidly mutating virus, yes.
> and easy to understand.
Gotta disagree there.
For example, one of the CSVs my company shovels around is our Azure billing data. There are several columns that I just have absolutely no idea what the data in them is. There are several columns we discovered are essentially nullable¹ The Hard Way when we got a bill for which, e.g., included a charge that I guess Azure doesn't know what day that charge occurred on? (Or almost anything else about it.)
(If this format is documented anywhere, well, I haven't found the docs.)
Values like "1/1/25" in a "date" column. I mean, I did say it was an Azure-generated CSV, so obviously the bar wasn't exactly high, but then it never is, because anyone wanting to build something with some modicum of reliability, or discoverability, is sending data in some higher-level format, like JSON or Protobuf or almost literally anything but CSV.
If I can never see the format "JSON-in-CSV-(but-we-fucked-up-the-CSV)" ever again, that would spark joy.
(¹after parsing, as CSV obviously lacks "null"; usually, "" is a serialized null.)
I've just checked and strings are escaped using the same mechanism for JSON, with backslashes. I should've double-checked against RFC 4180, but thankfully that mechanism isn't currently triggered anywhere for CSV (it's used for log exportation and no data for these triggers that code path). I've also checked the code from other teams and it's just handwritten C++ stream statements inside a loop that doesn't even try to escape data. It also happens to be fine for the same reason (log exportation).
I've also written serializers for JSON, BSON and YAML and they actually output spec-compliant documents, because there's only one spec to pay attention to. CSV isn't a specification, it's a bunch of loosely-related formats that look similar at a glance. There's a reason why fleshed-out CSV parsers usually have a ton of knobs to deal with all the dialects out there (and I've almost added my own by accident), that's simply not a thing for properly specified file formats.
The python3 csv module basically does the job.
"You have a REST API? Parquet format available? Delivery via S3? Databricks, you say? No thanks, please send us daily files in zipped CSV format on FTP."
Requires a programmer
> Parquet format
Requires a data engineer
> S3
Requires AWS credentials (api access token and secret key? iam user console login? sso?), AWS SDK, manual text file configuration, custom tooling, etc. I guess with Cyberduck it's easier, but still...
> Databricks
I've never used it but I'm gonna say it's just as proprietary as AWS/S3 but worse.
Anybody with Windows XP can download, extract, and view a zipped CSV file over FTP, with just what comes with Windows. It's familiar, user-friendly, simple to use, portable to any system, compatible with any program. As an almost-normal human being, this is what I want out of computers. Yes the data you have is valuable; why does that mean it should be a pain in the ass?
Financial users live in Excel. If you stick to one locale (unfortunately it will have to be US) then you are OKish.
To me, CSV is one of the best examples of why Postel's Law is scary. Being a liberal recipient means your work never ends because senders will always find fun new ideas for interpreting the format creatively and keeping you on your toes.
Another Microsoft BS, they should defaults to ENG locale in CSV, do a translation in background. And let user choose, if they want to save as different separator. Excel in every part of world should produce same CSV by default. Bunch of idiots.
Personally I think we missed a trick by not using the ASCII US and RS characters:
Columns separated by \u001F (ASCII unit separator).
Rows separated by \u001E (ASCII record separator).
No escaping needed.
More about this at:
https://successfulsoftware.net/2022/04/30/why-isnt-there-a-d...
Unless you control the producer of the data you are stuck trying to infer the character encoding and transcoding to your destination, and there's no foolproof way of doing that.
Code editors may convert tabs to spaces but are you really editing and saving TSV data files in your code editor?
There are some notes I put together about TSV a few years ago that expand on these points: https://github.com/solidsnack/tsv?tab=readme-ov-file#motivat...
Because of this in order to read a plain simple TSV (fields separated by tabs, nothing more) with the Python csv module [2] you need to set the quote character to an improbable value, say € (using the euro sign because HN won't let me use U+1F40D), or just parse it by hand, e.g. row.split('\t').
For other use cases I would use newline separated JSON. Is has most of the benefits as written in the article, except the uncompressed file size.
It has a downside though: wherever JSON itself is used, it tends to be a few kilobytes at least (from an API response, for example). If you collect those in a JSONL file the lines tend to get verrrry long and difficult to edit. CSV files are more compact.
JSONL files are a lot easier to work with though. Less headaches.
I wonder how much we have been hindered ourselves by reinventing plain text human-readable formats over the years. CSV -> XML -> JSON -> YAML and that's just the top-level lineage, not counting all the branches everywhere out from these. And the unix folks will be able to name plenty of formats predating all of this.
I don't know what they were thinking, using a UI setting for parsing an interchange format.
There's a way around, IIRC, with the "From text / csv" command, but that looses a lot of the convenience of double-clicking a CSV file in Explorer or whatever to open it in Excel.
That's it, but it's really bad.
It works better if you click to "import the data" instead of just opening the csv file with it, and if you then choose the right data types. But having to do this everytime to make it work is really annoying, esp when you have a lot of columns, plus people can easily get confused with the data types. I have never seen that much confusion eg with macos's numbers.
You can use CSV for interchange, but a duck db import script with the schema should accompany it.
That being said I noticed .parquet as an export format option on Shopify recently and an hopeful more providers offer the choice.
"Use the CSV on the Web (CSVW) standard to add metadata to describe the contents and structure of comma-separated values (CSV) data files." — UK Government Digital Service[2][3]
[1] https://www.w3.org/TR/tabular-data-primer/
[2] https://www.gov.uk/government/publications/recommended-open-...
1. https://github.com/manifold-systems/manifold/tree/master/man...
On of the biggest challenges to CSV files is the lack of data types on the header line that could help determine the schema for the table.
For example a file containing customer data might have a column for a Zip Code. Do you make the column type a number or a string? The first thousand rows might have just 5 digit numbers (e.g. 90210) but suddenly get to rows with the expanded format (e.g. 12345-1234) which can't be stored in an integer column.
If CSV is indeed so horrible - and I do not deny that there can be an improvement - how about the clever data people spec out a format that
Does not require a bizarre C++ RPC struct definition library _both_ to write and to read
Does not invent a clever number encoding scheme that requires native code to decode at any normal speed
Does not use a fancy compression algorithm (or several!) that you need - again - native libraries to decompress
Does not, basically, require you be using C++, Java or Python to be able to do any meaningful work with it
It is not that hard, really - but CSV is better (even though it's terrible) exactly because it does not have all of these clever dependency requirements for clever features piled onto it. I do understand the utility of RLE, number encoding etc. I do not, and will not, understand the utility of Thrift/Avro, zstandard and brotli and whatnot over standard deflate, and custom integer encoding which requires you download half of Apache Commons and libboost to decode. Yes, those help the 5% to 10% of the use cases where massive savings can be realised. It absolutely ruins the experience for the other 90 to 95.
But they also give Parquet and its ilk a very high barrier of entry.
The second worst thing is that the escape character cannot be determined safely from the document itself.
It makes working with CSV/TSV files super simple.
It's as easy this:
# Import tidyverse after installing it with install.packages("tidyverse")
library(tidyverse)
# Import TSV
dataframe_tsv <- read_tsv("data/FileFullOfDataToBeRead.tsv")
# Import CSV
dataframe_csv <- read_csv("data/FileFullOfDataToBeRead.csv")
# Mangle your data with dplyr, regular expressions, search and replace, drop NA's, you name it.
<code to sanitize all your data>
Multiple libraries exist for R to move data around, change the names of entire columns, change values in every single row with regular expressions, drop any values that have no assigned value, it's the swiss army knife of data. There are also all sorts of things you can do with data in R, from mapping with GPS coordinates to complex scientific graphing with ggplot2 and others.
Here's an example for reading iButton temperature sensor data: https://github.com/hominidae/ibutton_tempsensors/
Notice that in the code you can do the following to skip leading lines by passing it as an argument: skip = 18
cf1h <- read_csv("data/Coldframe_01_High.csv", skip = 18)
Reminds me of how I built a simple query language which does not require quotation marks around strings, this means that you don't need to escape strings in user input anymore and it prevents a whole bunch of security vulnerabilities such as query injections. The only cost was to demand that each token in the query language be separated by a single space. Because if I type 2 spaces after an operator, then the second one will be treated as part of the string; meaning that the string begins with a space. If I see a quotation mark, it's just a normal quotation mark character which is part of the string; no need to escape. If you constrain user input based on its token position within a rigid query structure, you don't need special escape characters. It's amazing how much security has been sacrificed just to have programming languages which collapse space characters between tokens...
It's kind of crazy that we decided that quotation marks are OK to use as special characters within strings, but commas are totally out of bounds... That said, I think Tab Separated Values TSV are even more broadly applicable.
For whatever reason, it flawlessly manages to import most CSV data using that functionality. It is the only way I can reliably import data to excel with datestamps / formats.
Just drag/dropping a CSV file onto a spreadsheet, or "open with excel" sucks.
It inserts an extra row at the top for its pivot table, with entries "Column1, Column2, ...".
So if you export to CSV again, you now have 2 header rows.
So Excel can't roundtrip CSVs, and the more often you roundtrip, the more header rows you get.
You need to remember to manually delete the added header row each time, otherwise software you export back to can't read it.
Now if they would just also allow pasting CSV data as “source” it would be great.
For one thing, it talks about needing only to quote commas and newlines... qotes are usually fine... until they are on either side of the value. then you NEED to quote them as well.
Then there is the question about what exactly "text" is; with all the complications around Unicode, BOM markers, and LTR/RTL text.
This is what keeps me coming back.
This is a double-edged sword. The "you might even event it yourself" simplicity means that in practice lots of different people do end up just inventing their own version rather than standardizing to RFC-4180 or whatever when it comes to "quote values containing commas", values containing quotes, values containing newlines, etc. And the simplicity means these type of non-standard implementations can go completely undetectable until a problematic value happens to be used. Sometimes added complexity that forces paying more attention to standards and quickly surfaces a diversion from those standards is helpful.
Aside from that: Yes, if CSV would be a intentional, defined format, most of us would do something different here and there. But it is not, it is more of a convention that came upon us. CSV „happened“, so to say. No need to defend it more passionate than the fact that we walk on two legs. It could have been much worse and it has surprising advantages against other things that were well thought out before we did it.
Way easier to parse
I also asked W3C on theirGithub if there was any spec for CSV headers and they said there isn't [1]. Kind of defeats the point of the spec in my opinion.
I looked at the RFC. What is controversial about it?
Of these 3 options sas datasets are my preference but I'll immediately convert to csv or excel, csv is a close 2nd once you confirm the quoting / seperator conventions it's very easy to parse. I understand why someone may find the csv format disagreeable but in my experience the alternatives can be so much worse I don't worry too much about csv files
It’s sort of like, the computing dream when I was growing up.
+1 to ini files. I like you can mess around with them yourself in notepad. Wish there was a general outline / structure to those though.
The first part of converting data to csv works fine with help of ai coding assistant.
The reverse part of csv to database is getting challenging and even claude sonnet 3.7 is not able to escape newline correctly.
I am now implementation the data format in json which is much simpler.
For numerical data, nothing beats packing floats into blobs.
But god help you if you have to accept CSVs from random people/places, or there's even minor corruption. Now you need an ELT pipeline and manual fix-ups. A real standard is way better for working with disparate groups.
My experience with CSVs is mostly limited to personal projects, and I generally find the format very convenient. That said, I occasionally (about once a year) run into issues that are tricky to resolve.
I can depend on parquet. The only real disadvantages with parquet are that they aren't human-readable or mutable, but I can live with that since I can easily load and resave them.
This is a good blog post and Xan is a really neat terminal tool.
9. Excel hates CSV
It clearly means CSV must be doing something right.
This is one area where LibreOffice Calc shines in comparison to Excel. Importing CSVs is much more convenient.You have to replace the "T" separator with a space and also any trailing "Z" UTC suffix (and I think any other timezone/offset as well?) for Excel to be able to parse as a time/date.
Then I go do it somewhere else and have to set it up all over again. Drives me nuts. How the default behavior isn't to just put them in the way you'd expect is mind-boggling.
No, CSV is dependently typed. Way cooler ;)
I wrote something about this https://github.com/Ericson2314/baccumulation/blob/main/datab...
Use a different separator if you need to.
CSV is the Vim of formats. If you get a CSV from 1970 you can still load it.
It was pretty straightforward (although tedious) to write custom CSV data exports in embedded C, with ZERO dependencies.
I know, I know, only old boomers care about removing pip from their code dev process, but, I'm an old boomer, so it was a great feature for me.
Straight out of libc I was able to dump data in real-time, that everyone on the latest malware OSes was able to import and analyze.
CSV is awesome!
The following are all valid CSV, and they should all mean the same thing, depending on your point of view:
1) foo, bar, foobar
2) "foo", "bar", "foobar"
3) "foo", bar, foobar
4) foo; bar; "foobar"
5) foo<TAB>bar<TAB>"foobar"
5) foo<EOL> bar<EOL foobar<EOL>
Have fun writing that parser!
Most reasonably large CSV files will have issues parsing on another system.
it's surely simple but..
- where is meta data? - how do you encode binary? - how do you index? - where are relationships? different files?
Does it though? Seems to be importing from and exporting to CSV just fine? Elaborate maybe.
Never want to use sas7bdat again.
CSV is lovely. It isn't trying to be cool or legendary. It works for the reasons the author proposes, but isn't trying to go further.
I work in a work of VERY low power devices and CSV sometimes is all you need for a good time.
If it doesn't need to be complicated, it shouldn't be. There are always times when I think to myself CSV fits and that is what makes it a legend. Are those times when I want to parallelise or deal with gigs of data in one sitting. Nope. There are more complex formats for that. CSV has a place in my heart too.
Thanks for reminding me of the beauty of this legendary format... :)
> CSV is not a binary format, can be opened with any text editor and does not require any specialized program to be read. This means, by extension, that it can both be read and edited by humans directly, somehow.
This is why you should run screaming when someone says they have to integrate via CSV. It's because they want to do this.
Nobody is "pretending CSV is dead." It'll never die, because some people insist on sending hand-edited, unvalidated data files to your system and not checking for the outcome until mid-morning the next day when they notice that the text selling their product is garbled. Then they will frantically demand that you fix it in the middle of the day, and they will demand that your system be "smarter" about processing their syntactically invalid files.
Seriously. I've worked on systems that took CSV files. I inherited a system in which close to twenty "enhancement requests" had been accepted, implemented, and deployed to production that were requests to ignore and fix up different syntactical errors, because the engineer who owned it was naive enough to take the customer complaints at face value. For one customer, he wrote code that guessed at where to insert a quote to make an invalid line valid. (This turned out to be a popular request, so it was enabled for multiple customers.) For another customer, he added code that ignored quoting on newlines. Seriously, if we encountered a properly quoted newline, we were supposed to ignore the quoting, interpret it as the end of the line, and implicitly append however many commas were required to make the number of fields correct. Since he actually was using a CSV parsing library, he did all of this in code that would pre-process each line, parse the line using the library, look at the error message, attempt to fix up the line, GOTO 10. All of these steps were heavily branched based on the customer id.
The first thing I did when I inherited that work was make it clear to my boss how much time we were spending on CSV parsing bullshit because customers were sending us invalid files and acting like we were responsible, and he started looking at how much revenue we were making from different companies and sending them ultimatums. No surprise, the customers who insisted on sending CSVs were mostly small-time, and the ones who decided to end their contracts rather than get their shit together were the least lucrative of all.
> column-oriented data formats ... are not able to stream files row by row
I'll let this one speak for itself.