Know your Data Formats

October 21, 2013 in Data for CSOs


The emancipatory potential of data lies dormant until data is given life in computational applications. Data visualizations, interactive applications, and even simple analyses of data all require that data be made intelligible to some kind of computational process. Democratic engagement with data depends on data being intelligible to as many forms of computation as possible.

Data should therefore be distributed in a format that is both machine-readable and open. Machine readability ensures that data can be processed with a minimum of human intervention and fuss. The use of an open file format gives users access to information without proprietary or specialized software. Any deviation from machine readability and format openness represents a hindrance to user engagement.

Machine readability

“Machine readability” means making meaningful structure explicit. The most machine-readable formats make their structure completely transparent. Unstructured documents demand that the user create structure from scratch.

Unstructured documents are not fundamentally bad—if they were, the fact that the most popular file formats for documents (PDF, Word) and bitmap images (GIF, JPEG, PNG, BMP) are unstructured would be very strange. Unstructured documents are simply unsuitable as vehicles for data. They are designed to be displayed on a screen or printed rather than to be processed programmatically. Machine-readable data formats, on the other hand, are simple and direct encodings of standard data structures. Since they contain no display information, they are not particularly easy for humans to read. Data, however, is not meant to be simply read in the raw.

Data comes in many structures, but the most common structure is the table. A table represents a set of data points as a series of rows, with a column for each of the data points’ properties. Each property may take on any value that can be represented as a string of letters and numbers. The machine-readable CSV (comma-separated values) or TSV (tab-separated values) formats are excellent encodings of tabular data. CSV and TSV files are simply plain text files in which each line represents a row and, within each line, a comma (for CSV) or a tab character (for TSV) separates columns. All data wrangling systems, great and small, include facilities for working with CSV and TSV files.

image02 Two views of a CSV table: Toronto Mayor Rob Ford’s voting record, from toronto.ca. Above: display view from toronto.ca website. Below: plaintext view in Sublime Text 2 text editor.

Some data includes structure which cannot be explicitly encoded in a table. Say, for example, each data point can be associated with some arbitrarily long list of names. This list could be represented by a string—but as far as the table structure itself is concerned, this list is not a list but just an ordinary string with the same structure as a name or a sentence. Additional structure like lists can be directly encoded with more flexible formats like JSON (JavaScript Object Notation) or XML (eXtensible Markup Language). JSON represents data in terms of JavaScript data types, including arrays for lists and “objects” for key-value maps. XML represents data as a tree of HTML-like “elements”. Both formats are very widely supported.

image00 Some XML data, viewed in Sublime Text 2: the opening words of Ovid’s
Metamorphoses, from the Latin Dependency Treebank.

Open formats

Not all data formats are created equal. Some are created under restrictive licenses or are designed to be used with a particular piece of software. Such formats are not suitable for the distribution of data, as they are only useful to users with access to their implementations and will cease to be useable at all once their implementations are unavailable or unsupported.

Closed and proprietary file formats are encountered with tragic frequency in the world of data distribution. The most common such formats are the output of Microsoft Office software, including Microsoft Word documents (.doc, .docx) and Microsoft Excel spreadsheets (.xls, .xlsx). Many pieces of free and open-source software are able to import Microsoft Office documents, but these document formats were not designed with this in mind and present considerable difficulties.

Using open and free file formats does a great deal to lower the barrier to entry. An open file format is one defined by a published specification that is made publicly available and which can therefore be implemented by anyone who cares to do so. All of the machine-readable formats described in the Machine readability section above are open formats. All of them are therefore eminently suitable as vectors of data distribution.

image01 Some JSON data, viewed in Sublime Text 2: Green P Parking data from toronto.ca

Getting started

Getting started with machine-readable and open data can be as simple as clicking “Save As”. For tabular data, most spreadsheet software—Microsoft Excel, LibreOffice, Google Drive—allows you to choose to save your spreadsheet as a CSV. Making this choice is a first step towards making your data public. For many purposes, this first step is enough.

Going beyond tabular data generally means going beyond spreadsheet software into the realm of programming. JSON and XML are generally created, as well as processed, by specially written code. Learning to read and write structured data with code using standard tools (e.g. Python’s [json (http://docs.python.org/2/library/json.html) and [/xml (http://docs.python.org/2/library/xml.html) libraries) is an excellent way to begin your new life as a civic hacker.

Flattr this!