Choose data formats that can be used by many tools

· ddrake's blog

the value of data is proportional to the number of tools that can process that data

# The value of your data is proportional to the number of tools with which it can be processed.

I previously wrote about choosing tools that can't be taken away from you. But tools -- software, apps, and so on -- act on data. A good tool is useless without data it can work on, and vice versa.

So: if your data is in a format that's legible to many tools, it's just more likely that one of them will do what you need.

# Actual life-and-death consequences: Excel spreadsheets, COVID-19 data

During the early part of the COVID-19 pandemic, the UK government was tracking test results and doing contact tracing -- before the vaccines and treatments like Paxlovid, this was our vital, first-line public health response.

Various labs reported their data, and at some point this got into an Excel spreadsheet.

At first glance, this seems like a great idea, especially at that time: everyone was scrambling and doing something fast and effective was crucial. An Excel spreadsheet is easy to make, send to others, and everyone has Excel -- so putting your data in that format is good, right?

Ah, but the devil is in the details: there are different versions of Excel and its underlying file format, and at one point, someone used an older version of the program, or a file format, that only supported one million rows. (Or, more likely, 2^20 = 1,048,576 rows.) And the program silently truncated the data.

This caused positive test results to be lost, and hence contacts to not be traced, and no doubt, to further COVID-19 infections, and no doubt to deaths.

See the BBC and The Guardian, and also a more general cautionary tale on data and spreadsheets from, well, this Cautionary Tales episode.

Now, part of the problem was the tool and the user interface which silently ignored user data -- the program could have alerted the user and somehow allowed them to split the too-large file into multiple files, maybe -- but the data format was the key limitation.

In this case, using a "dumber" text-based comma-separated values format would have been better. There's no arbitrary limit to how big such files can be; if you lose data from a CSV file, it's not the data format's fault.

# Data that can be taken away from you

In addition to simply preserving data, you need to be able to read it.

In my previous polemic, I described situations where tools were forcibly taken from me, leaving me with piles of data that was now rendered useless. Imagine Excel spreadsheet if you don't have access to Microsoft software.

(Okay, that's almost not a good example, because for Excel in particular, our reliance on that format is so pervasive that there are countless third-party tools that can read that format, so that Excel spreadsheets are perfectly useful and legible even to someone who never uses any Microsoft software!)

If you have data in a format readable by very few tools, the risk that those tools will change, or somehow make your data unreadable, is larger.

Conversely, choosing data formats legible by many, many tools opens you up to an upside risk...

# Upside risk: more tools may mean more value from your data

You have some data. You have a tool that works with that data, and you have some task you need to do with that data.

Imagine your tool can do the task. Nice!

But what if a different tool could do it better?

As a concrete example, imagine needing to change a date format in a spreadsheet; maybe from something with slashes into something with dashes, or similar.

Sure, you can do that in a spreadsheet program. But if your data is in CSV, or some other text-based format, you could write a Python script to do that. Or use sed, or awk, or any one of the thousands of other tools that can do search-and-replace in text. Maybe you can do the task very quickly and easily with another tool. Even nicer!