The Data Pipeline
The Data Pipeline is School of Data’s approach to working with data from beginning to end. Once you understand your action cycle and the stakeholders, it will be time to work with the data and we have broken down this process in steps. The Data Pipeline is a work in progress, we started out suggesting five steps, but our community is constantly experimenting and tweaking it to reflect the core steps that are present in every kind of data-driven projects. The steps are:
* Define: Data-driven projects always have a “define the problem you’re trying to solve” component. It’s in this stage you start asking questions and come around the issues that will matter in the end.
* Find: We also have to find the data we need. There are a lot of tools and techniques to do that, ranging from a simple question on your social network, to using a search engine, open data portals or a Freedom of Information request querying about what data is available in that branch of government.
* Get: You need to get the data! And there’s plenty of ways of doing that. You can crowdsource using online forms, you can perform offline data collection, you can use some crazy web scraping skills, or you could simply download the datasets from government sites, using their data portals or through a Freedom of Information request.
* Validate: We got our hands in the data, but that doesn’t mean it’s the data we need. We have to check out if details are valid, such as the meta-data, the methodology of collection, if we know who organised the dataset and it’s a credible source. We’ve heard a joke once, but it’s only funny because it’s true: all data is bad, we just need to find out how bad it is!
* Clean: It’s often the case the data we get and validate is messy. Duplicated rows, column names that don’t match the records, values that contain characters which will make it difficult for a computer to process and so on. In this step, we need skills and tools that will help us get the data into a machine-readable format, so that we can analyse it. We’re talking about tools like OpenRefine or LibreOffice Calc and concepts like relational databases.
* Analyse: This is it! It’s here where we get insights about the problem we defined in the beginning. We’re gonna use our mad mathematical and statistical skills to interview a dataset like any good journalist. But we won’t be using a recorder and a notebook. We can analyse datasets using many, many skills and tools. We can use visualisations to get insights of different variables, we can use programming languages packages, such as Pandas (Python) or simply R, we can use spreadsheet processors, such as LibreOffice Calc or even statistical suites like PSPP.
* Present: And, of course, you will need to present your data. Presenting it is all about thinking of your audience, the questions you set out to answer and the medium you select to convey your message or start your conversation. You don’t have to do all by yourself, it’s good practice to get support from professional designers and storytellers, who are experts at understanding what are the best ways to present data visually and with words.
The Data Expedition
Data Expeditions are quests to map uncharted territory, discover hidden stories and solve unsolved mysteries in the Land of Data. In a team you’ll tackle a problem, answer a question or work on a data project. We help you to get started and decide on the direction you want to take then guide you through the various steps of the expedition. The various steps of the data expedition are defined after the the data pipeline, and structure your work on the project.
Two of our central philosophies are learn by doing and work with real data. Data expeditions are the truest embodiment of these philosophies. You’ll go in circles, get lost, make mistakes, and sometimes not reach your goal. But that’s fine! This is best way to get familiar with working with data. And you may even make some new friends along the way.