![]() |
|
|
This is plea for sanity in the software development world: While we need to write CSV importers from time to time, please never write CSV exporters. CSV must die. The problem with CSV is that the comma is very commonly used in data, making it a uniquely stupid choice as a delimiter. That stupidity could be countered with consistent escaping, but no true standard has emerged in the many decades since this productivity-abuse began. Different software products using different methods of escaping. For example, Microsoft products have used a variety of different escaping methods over the years, sometimes changing between versions of a single product. There simply is no consistency on this. Making a stupid decision even more stupid, most CSV files quote non-numeric values, as though the programmers did not realize the quote character is also commonly used in our language and therefore may likely be part of the data. So using quote as an escape means that you must escape the escape. No, really. Sometimes the escape for in-data double-quotes is a double double-quote, which can make it hard to know what to do with empty values shown as "", esp. given that in some formats the empty value abuts the adjacent commas and in others, like MySQL dumps, it abuts only the trailing comma but has a space before the leading comma. Other times double-quotes are escaped with \", meaning you'll need to escape any in-data backslashes too. For thinking people, about the time you realize that you're escaping the escape that you've escaped to handle your data, it would occur to you to go back and check your original premise. But not so for the creators of CSV. To appreciate the scope of this insanity, let's look at a popular sample data set used for testing CSV parsers: FirstName,LastName,Address,City,State,Zip John,Doe,120 jefferson st.,Riverside, NJ, 08075 Jack,McGinnis,220 hobo Av.,Phila, PA,09119 "John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075 Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234 ,Blankman,,SomeTown, SD, 00298 "Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123
FirstName,LastName,Address,City,State,Zip John,Doe,120 jefferson st.,Riverside, NJ, 08075 Jack,McGinnis,220 hobo Av.,Phila, PA,09119 "John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075 Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234 ,Blankman,,SomeTown, SD, 00298 "Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123 With that example data set, here's an exercise for the reader: using your favorite programming language, write a parser which translates that data set into a simple tab-delimited format. CSV parsers are available in libraries for most languages, but many of them don't handle the full range of exceptions presented in that data set, which will inevitably show up in real-world use sooner or later. And when these parsers fail data is imported incorrectly or not at all, cutomers become frustrated, support costs rise. This happens every day across the software world, costing humanity millions of person-hours for programmers and end-users alike. To illustrate the complexity inherent in such an algorithm, here's a LiveCode function to translate CSV into a simple tab-delimited format, courtesy of Alex Tweedly via Mike Kerner's Github repository. LiveCode makes a good example here, because it's readable enough that programmers familiar with nearly anything else can probably follow it well enough: function CSVToTab pData, pOldLineDelim, pOldItemDelim, pNewCR, pNewTAB -- v 3 -> 4 put back in replace TAB with pNewTAB in 'passedquote' -- v 4 -> 5 put in the two replace statements in the trailing empty case -- fill in defaults if pOldLineDelim is empty then put CR into pOldLineDelim if pOldItemDelim is empty then put COMMA into pOldItemDelim if pNewCR is empty then put numtochar(11) into pNewCR -- Use You've probably noticed the horror straight away: since commas are the delimiter and can also appear in data, the only reliable algorithm requires very careful examination of the data, maintaining a flag to note when you're within a field and when you've found a field boundary. It's hard to imagine a format which would require greater
inefficiency. Even if you get it right, the best you can achieve
is a tremendous waste of clock cycles, contributing even further
to the time humanity loses to the CSV format. Okay, CSV is stupid. So what should we use instead? Tab-delimited or even (in fewer cases) pipe-delimited make much saner options. For my delimited exports in Fourth World products I've adopted the convention used in FileMaker Pro and others, with escapes that are unlikely to be in the data:
Simple to write, lightning-fast to parse. When you add up all the programmer- and end-user-hours lost to dealing with the uniquely stupid collection of mish-mashed ad-hoc formats that is CSV, it amounts to nothing less than a crime against humanity. Several hundred if not thousands of human lifetimes have been lost either dealing with bugs related to CSV parsers, or simply waiting for the inherently slow parsing of CSV that could have taken mere milliseconds if done in any saner format. Good moral upbringing and a sense of responsibility to humanity compels me to note Postel's Law in any discussion of the ridiculously insane inefficiency inherent in parsing CSV:
CSV must die. Please help it die: never write CSV exporters. |