Data Sources
There are three basic ways of getting hold of data:
- Finding data – This involves searching and finding data that has already been released
- Getting hold of more data – Asking for ‘new’ data from official sources e.g. through Freedom of Information requests. Sometimes data is public on a website but there is not a download link to get hold of it in bulk – but don’t give up! This data can be liberated with what datawranglers call scraping.
- Collecting data yourself – This means gathering data and entering it into a database or a spreadsheet – whether you work alone or collaboratively.
In this tutorial we’ll focus on finding data that already has been released. We will deal with getting more data and collecting data yourself in future courses.
Step 1: Identify your Data Source
Many sources frequently release data for public use. Some examples:
- Government: In recent years governments have begun to release some of their data to the public. Many governments host special (open) government data platforms for the data they create. For example, the UK government started data.gov.uk to release their datasets. Similar data portals exist in the US, Brazil and Kenya – and in many other countries! Does your country have an open data portal (datacatalogs.org is a good starting point)?
- Organisations: Other sources of data are large organisations. The World Bank and the World Health Organization for example regularly release reports and data sets.
- Science: Scientific projects and institutions release data to the scientific community and the general public. Open data is produced by NASA for example, and many specific disciplines have their own data repositories, some of which are open. More and more initiatives exist trying to provide access to already published data (e.g. Dryad)
To help people to find data, projects like the Open Access Directory’s data repository list or datahub.io have been started. They aim either to collect data sources, or collect together different data sets from various sources.
Step 2: Getting data in the format you need it
In the “What is Data” section, we talked briefly about the importance of machine-readable data. You’ll save yourself a lot of trouble and time in working with the data if you get hold of data in the correct format initially. Here’s a handy tip for how to tell Google which format you are looking for.
Finding more Data using Google
You can search for CSV files on Google by typing +filetype:csv in the search bar. Searching for “South Africa +filetype:csv” will result in CSV files mentioning South Africa. You can try different other filetypes as well (such as: “xls” for excel spreadsheets or “pdf”)
