Now we know what data is and the questions we’re interested in, we’re ready to go out and hunt for it online.
In this tutorial, you will learn where to start looking for data. In this course, we will then look at different ways of getting hold of data, before setting you loose to find data yourselves!
There are three basic ways of getting hold of data:
- Finding data – this involves searching and finding data that has already been released
- Getting hold of more data – asking for ‘new’ data from official sources e.g. through Freedom of Information requests. Sometimes data is public on a website but there is not a download link to get hold of it in bulk – but don’t give up! This data can be liberated with what datawranglers call scraping.
- Collecting data yourself – This means gathering data and entering it into a database or a spreadsheet – whether you work alone or collaboratively.
In this tutorial we’ll focus on finding data that already has been released. We will deal with getting more data and collecting data yourself in future courses.
Step 1: Identify your Data Source
Many sources frequently release data for public use. Some examples:
- Government In recent years governments have begun to release some of their data to the public. Many governments host special (open) government data platforms for the data they create. For example the UK government started data.gov.uk to release their datasets. Similar data portals exist in the US, Brazil and Kenya – and in many other countries! Does your country have an open data portal (Datacatalogs.org is a good starting point)?
- Organisations Other sources of data are large organisations. The World Bank and the World Health Organization for example regularly release reports and data sets.
- Science Scientific projects and institutions release data to the scientific community and the general public. Open data is produced by NASA for example, and many specific disciplines have their own data repositories, some of which are open. More and more initiatives exist trying to provide access to already published data (e.g. Dryad)
To help people to find data, projects like the Open Access Directory’s data repository list or the Open Knowledge Foundation’s datahub.io have been started. They aim either to collect data sources, or collect together different data sets from various sources.
Step 2: Getting data in the format you need it
In the “What is Data” course, we talked briefly about the importance of machine-readable data. You’ll save yourself a lot of trouble and time in working with the data if you get hold of data in the correct format initially. Here’s a handy tip for how to tell Google which format you are looking for.
Using data to answer your question
Now that you have an overview of some of the key concepts related to data, it’s time to start hunting for your own! Over the next courses in the Data Fundamentals series, we will be further exploring the question we posed ourselves in the What is Data Course? How does healthcare spending influence life expectancy?. To get the data for this course, please see our recipe on Getting Data from the World Bank.
Task: If you found your own alternative data to answer this question, congratulations! Take a moment to upload it to the DataHub – and have a browse to see what other School of Data learners have found.
Extension Task: Explore the web, and see what open data you can find. If you find something really interesting and think of an exciting question it could help to address, tweet it to @SchoolofData – or write a short post for the School of Data blog.
In this tutorial we discussed how we get the data to answer our question. We explored different ways of accessing data sources and introduced several resources listing different data portals and search engines.
At the beginning of Data Fundamentals, we posed ourselves a question: ‘How does healthcare spending influence life expectancy?’, and by following the recipe, have found a dataset from the World Bank that will help us to answer that question.
Last updated on Sep 02, 2013.