Fourth World Logo Fourth World Media Corporation
  Embassy Services Products Resources   About Fourth World Contact  
logo



CSV Must Die

A Plea to Introduce Sanity to the Software Development World by
Pledging to Never Write Exporters for Comma-Separated Values

Richard Gaskin
Fourth World Systems
ambassador@fourthworld.com

First draft published: 24 July, 2011
Updated 10 June, 2016

Summary:

1. A delimiter character is useful in inverse proportion to the frequency with which it appears in content.
2. Postel's Law: "Be liberal in what you accept, and conservative in what you send."


This is plea for sanity in the software development world:

While we need to write CSV importers from time to time, please never write CSV exporters.

CSV must die.

The problem with CSV is that the comma is very commonly used in data, making it a uniquely stupid choice as a delimiter.

That stupidity could be countered with consistent escaping, but no true standard has emerged in the many decades since this productivity-abuse began. Different software products using different methods of escaping. For example, Microsoft products have used a variety of different escaping methods over the years, sometimes changing between versions of a single product. There simply is no consistency on this.

Making a stupid decision even more stupid, most CSV files quote non-numeric values, as though the programmers did not realize the quote character is also commonly used in our language and therefore may likely be part of the data. So using quote as an escape means that you must escape the escape. No, really.

Sometimes the escape for in-data double-quotes is a double double-quote, which can make it hard to know what to do with empty values shown as "", esp. given that in some formats the empty value abuts the adjacent commas and in others, like MySQL dumps, it abuts only the trailing comma but has a space before the leading comma. Other times double-quotes are escaped with \", meaning you'll need to escape any in-data backslashes too.

For thinking people, about the time you realize that you're escaping the escape that you've escaped to handle your data, it would occur to you to go back and check your original premise. But not so for the creators of CSV.

To appreciate the scope of this insanity, let's look at a popular sample data set used for testing CSV parsers:

FirstName,LastName,Address,City,State,Zip
John,Doe,120 jefferson st.,Riverside, NJ, 08075
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123


Handy as it is, it still doesn't reprepsent the full range of allowable variants, as it doesn't include in-data returns which are an acceptable practice in CSV. This modified example includes one (note "At the Plaza"):

FirstName,LastName,Address,City,State,Zip
John,Doe,120 jefferson st.,Riverside, NJ, 08075
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At
  the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123	

With that example data set, here's an exercise for the reader: using your favorite programming language, write a parser which translates that data set into a simple tab-delimited format.

CSV parsers are available in libraries for most languages, but many of them don't handle the full range of exceptions presented in that data set, which will inevitably show up in real-world use sooner or later. And when these parsers fail data is imported incorrectly or not at all, cutomers become frustrated, support costs rise. This happens every day across the software world, costing humanity millions of person-hours for programmers and end-users alike.

To illustrate the complexity inherent in such an algorithm, here's a LiveCode function to translate CSV into a simple tab-delimited format, courtesy of Alex Tweedly via Mike Kerner's Github repository. LiveCode makes a good example here, because it's readable enough that programmers familiar with nearly anything else can probably follow it well enough:

function CSVToTab pData, pOldLineDelim, pOldItemDelim, pNewCR, pNewTAB
   -- v 3 -> 4   put back in replace TAB with pNewTAB in 'passedquote'
   -- v 4 -> 5   put in the two replace statements in the trailing empty case
      -- fill in defaults
      if pOldLineDelim is empty then put CR into pOldLineDelim
      if pOldItemDelim is empty then put COMMA into pOldItemDelim
      if pNewCR is empty then put numtochar(11) into pNewCR   -- Use  for quoted CRs
      if pNewTAB is empty then put numtochar(29) into pNewTAB      -- Use  (group separator) for quoted TABs
      --
      local tNuData                         -- contains tabbed copy of data
      local tStatus, theInsideStringSoFar
      --
      put "outside" into tStatus
      set the itemdel to quote
      repeat for each item k in pData
            -- put tStatus && k & CR after msg
            switch tStatus
                  case "inside"
                        put k after theInsideStringSoFar
                        put "passedquote" into tStatus
                        next repeat
                  case "passedquote"
                        -- decide if it was a duplicated escapedQuote or a closing quote
                        if k is empty then   -- it's a duplicated quote
                              put quote after theInsideStringSoFar
                              put "inside" into tStatus
                              next repeat
                        end if
                        -- not empty - so we remain inside the cell, though we have left the quoted section
                        -- NB this allows for quoted sub-strings within the cell content !!
                        replace pOldLineDelim with pNewCR in theInsideStringSoFar
                        replace TAB with pNewTAB in theInsideStringSoFar
                        put theInsideStringSoFar after tNuData
                  case "outside"
                        replace pOldItemDelim with TAB in k
                        -- and deal with the "empty trailing item" issue in Livecode
                        replace (pNewTAB & pOldLineDelim) with pNewTAB & pNewTAB & CR in k
                        put k after tNuData
                        put "inside" into tStatus
                        put empty into theInsideStringSoFar
                        next repeat
                  default
                        put "defaulted"
                        break
            end switch
      end repeat
      --
      -- and finally deal with the trailing item isse in input data
      -- i.e. the very last char is a quote, so there is no trigger to flush the
      --      last item
      if the last char of pData = quote then
                  replace pOldLineDelim with pNewCR in theInsideStringSoFar
                  replace TAB with pNewTAB in theInsideStringSoFar
            put theInsideStringSoFar after tNuData
      end if
      --
      return tNuData
end CSVToTab 

You've probably noticed the horror straight away: since commas are the delimiter and can also appear in data, the only reliable algorithm requires very careful examination of the data, maintaining a flag to note when you're within a field and when you've found a field boundary.

It's hard to imagine a format which would require greater inefficiency. Even if you get it right, the best you can achieve is a tremendous waste of clock cycles, contributing even further to the time humanity loses to the CSV format.

Okay, CSV is stupid. So what should we use instead?

Tab-delimited or even (in fewer cases) pipe-delimited make much saner options.

For my delimited exports in Fourth World products I've adopted the convention used in FileMaker Pro and others, with escapes that are unlikely to be in the data:

  • records are delimited with returns
  • fields delimited with tabs
  • quotes are never added and are included only when they are part of the data
  • any in-data returns are escaped with ASCII 11
  • any in-data tabs escaped with ASCII 29

Simple to write, lightning-fast to parse.

When you add up all the programmer- and end-user-hours lost to dealing with the uniquely stupid collection of mish-mashed ad-hoc formats that is CSV, it amounts to nothing less than a crime against humanity. Several hundred if not thousands of human lifetimes have been lost either dealing with bugs related to CSV parsers, or simply waiting for the inherently slow parsing of CSV that could have taken mere milliseconds if done in any saner format.

Good moral upbringing and a sense of responsibility to humanity compels me to note Postel's Law in any discussion of the ridiculously insane inefficiency inherent in parsing CSV:

"Be liberal in what you accept, and conservative in what you send."

CSV must die.

Please help it die: never write CSV exporters.