Digital Methods Initiative Winter School, University of Amsterdam
Last week I attended the 7th annual Winter School at Amsterdam University. Run by the Digital Methods Initiative, it took the form of a data sprint in which students joined professional developers and designers to answer research questions using social media data.
The DMI group at Amsterdam have developed and collated a suite of easy-to-use tools specifically for this kind of research. They are well worth checking out for anyone interested in this field and they cover a range of techniques from web scraping to list triangulation, and can be found online here.
I joined a group looking at bias across three APIs through which you can acquire Twitter data: the Search API, the Stream API and the proprietary Firehose endpoint – generally regarded as the most complete source of Twitter data. We had three sets captured from the three separate APIs for a critical period between 7th and 15th October 2014 when the Hong Kong protests were taking place.
Other groups took on a range of tasks from mapping the open data revolution to tracking the global climate change debate. All projects deployed a range of data wrangling techniques to answer these complex social, political and cultural phenomena.
A few things I learned:
- Anyone wanting to use social media data to answer research questions about society and culture needs more than just spreadsheet skills. These datasets are generally larger than what Excel can comfortably handle, so basic database skills are a massive help.
- Off-the-shelf tools for data analysis are brilliant, but often one needs to tweak lines of enquiry to your specific research question. Having some knowledge of programming means that you can take a much more flexible approach then when relying on the GUI tools.
- Working in such a collaborative fast-paced environment meant that reproducibility (ie. where different parts of the team would re-use scripts and code developed by other parts of the team) was essential, alongside creating documentation on the fly. We found iPython notebooks especially useful for this, whereas analytical steps taken in Excel were harder to reproduce.
- Free Twitter data – like that which can be acquired from the Search and Stream API – is still good, and sometimes better than that which you get through the proprietary APIs. When investigating online reactions to contentious and controversial events – such as the Hong Kong protests – tweets will inevitably be removed both by users and Twitter. If you want to get the full story, it’s far better to scape data as it comes in through the streaming API.
- We’ve written about it before on this blog but the Pandas module for Python is brilliant for data wrangling and analysis and well worth getting to know if you plan on working with big datasets. It’s quick, flexible and powerful.
- Nothing beats hands-on learning when it comes to technical skills. Having a motivating research question and some real life data is the best way to learn how to use the multitude of tools now at any budding data wranglers disposal. I learnt more in a week than I could have in months reading about tools and languages in the abstract!
For those interested in attending a DMI school in the future – take a look at the summer school coming later in 2015.