Knowing how to get started with an exploratory data analysis can often be one of the biggest stumbling blocks if a data set is new to you, or you are new to working with data. I recently came across a powerful example from Al Essa/@malpaso where he illustrates one way in to exploring a new data set – explaining a set of apparent outliers in the data. (Outliers are points that are atypical compared to the rest of data, in this example by virtue of taking on extreme values compared to other data points collected at the same time.)
The case refers to an investigation of life expectancy data obtained from the World Bank (World Bank data sets: life expectancy at birth*), and how Al tried to find what might have caused an apparent crash in life expectancy in Rwanda during the 1990s: The Rwandan Tragedy: Data Analysis with 7 Lines of Simple Python Code
*if you want to download the data yourself, you will need to go into the Databank page for the indicator, then make an Advanced Selection on the Time dimension to select additional years of data.
The environment that Al uses to analyse the data in the case study is iPython Notebook, an interactive environment for editing Python code within the browser. (You can download the necessary iPython application from here (I installed the Anaconda package to try it), and then followed the iPython Notebook instructions here to get it running. It’s all a bit fiddly, and could do with a simpler install and start routine, but if you follow the instructions it should work okay…)
iPython is not the only environment that supports this sort of exploratory data analysis, of course. For example, we can do a similar analysis using the statistical programming language R, and the ggplot2 graphics library to help with the chart plotting. To get the data, I used a special R library called to WDI that provides a convenient way of interrogating the World Bank Indicators API from within R, and makes it easy to download data from the API directly.
I have posted an example of the case study using R, and the WDI library, here: Rwandan Tragedy (R version). The report was generated form a single file written using a markup language called R markdown in the RStudio environment. R markdown provides a really powerful workflow for creating “reproducible reports” that combine analysis scripts with interpretive text (RStudio – Using Markdown). You can find the actual R markdown script used to generate the Rwanda Tragedy report here.
As you have seen, exploratory data analysis can be thought of as having a conversation with data, asking it questions based on what answers it has previously told you, or based on hypotheses you have made using other sources of information or knowledge. If exploratory data analysis is new to you, try walking through the investigation using either iPython or R, and then see if you can take it further… If you do, be sure to let us know how you got on via the comments:-)