An Introduction to SQL Databases for Data Analysis – Part 2

This module is the second part to the introduction to the SQLite database and the SQL database query language. We will discover how we can start analysing a real dataset using SQL.

Module Objectives

  • Understand how you can use SQL to do analysis
  • Learn intermediate-level SQL queries

Prerequisites

Introduction

Let’s now go and do some data analysis using SQL. We’ll ask a series of questions and try to answer it using SQL – the queries we build will get more and more complex.

Opening the database

In your sql prompt type

.open [path-and-name-of-sqlite-file]

  • this will open the database.
    If you have an older version of sqlite that does not have support for the .open command, you need to quit sqlite (.quit) and start sqlite with the name of the file you would like to open:

sqlite3 [path-and-name-of-sqlite-file]

If you look which tables there are (with the command .tables) there is one: ‘companies’ – look at the schema (with the command .schema companies) and see what fields are there.

How many companies do we deal with?

Next, let’s find out how many companies we have. In SQL language, our query will count the entries in a specific column.

tip: working with the ID column is most of the time a safe bet, as it should always be filled. Remember when we talked about “primary keys” earlier? In this example the ID column is our primary key.

In SQL:

SELECT count(id) FROM companies;

This will tell you we have roughly one million companies.

Which type of companies do we have?

In the data you’ll see a column called type – this is the type of the company that is registered. Let’s see which different types we have. We can do this with the SQL

SELECT distinct(type) FROM companies;

To find out how many types there are we can combine expressions such as count and distinct:

SELECT count(distinct(type)) FROM companies;

Now an you find out how many different cities we talk about?

Which are the most common company types?

The next thing we might ask ourselves is which companies are the most common? For this we will need to count the ids in each category. SQL does this with the GROUP BY statement:

SELECT type,count(id) FROM companies GROUP BY type;

Did you notice that we told the computer to select two different variables? The type and the count of numbers by type? You can specify multiple columns you want to get (and what operations you want to perform on them) if you separate them by commas.

Can we sort the results? – Yes:

SELECT type,count(id) AS count FROM companies GROUP BY type ORDER BY count DESC;

We introduced two new concepts here: Naming of the return column (count(id) AS count) and the ordering of results (ORDER BY). The ordering can be descending (DESC) or ascending (ASC).

Now let’s get the top 5:

SELECT type,count(id) AS count FROM companies GROUP BY type ORDER BY count DESC LIMIT 5;

How many companies of each company type is in bankruptcy?

The Company database is nice and gives us information whether or not a company is in bankruptcy (the bankruptcy column). Let’s first look at how this is distributed.

SELECT bankruptcy,count(id) FROM companies GROUP BY bankruptcy;

We see that only about 3700 companies in the register are bankrupt.

Which types are those? We can add another GROUP BY statement.

SELECT type,bankruptcy,count(id) FROM companies GROUP BY type, bankruptcy;

This will give us the number of companies in bankruptcy (and those who are not) by category. If we only want to see one of those – we could of course filter.

SELECT type,count(id) FROM companies WHERE bankruptcy=’J’ GROUP BY type;

Where are the companies?

Let’s find out where the companies are – let’s find the top cities. We already know how to do this by finding the top types:

SELECT city, count(id) FROM companies GROUP BY city;

There are many cities with weird names, certainly because they were errors when entering the data in the database.

Let’s find the Top 10:

SELECT city, count(id) AS count FROM companies GROUP BY city ORDER BY COUNT DESC LIMIT 10;

Fantastic – now we can start doing statistics on how many companies are where!

Which City has the biggest bankruptcy rate?

Although bankruptcies are generally rare – let’s look at which city has the biggest bankruptcy rate…

First we’ll need to find the number of companies per city.

SELECT city, count(id) AS count FROM companies GROUP BY city;

Now let’s get complicated: we want the total of companies and the number of bankrupt ones, for the top20 cities with the most bankruptcies

SELECT companies.city, count(companies.id) AS total,bankrupt.count 
FROM companies 
LEFT OUTER JOIN (
     SELECT city, count(id) AS count 
     FROM companies 
     WHERE bankruptcy='J' 
     GROUP BY city) 
AS bankrupt 
ON bankrupt.city=companies.city 
GROUP BY companies.city
ORDER BY count DESC
LIMIT 20;

OK what happened here? We joined two queries. The one for total counts :

SELECT companies.city, count(companies.id) AS total,bankrupt.count 
FROM companies 

and the one for bankrupt counts:

SELECT city, count(id) AS count 
     FROM companies 
     WHERE bankruptcy='J' 
     GROUP BY city)
AS BANKRUPT

We nested those two queries together by using the JOIN statement:

LEFT OUTER JOIN (...)

Because we needed to link the number of companies per city and the number of bankrupt companies per city, we suddenly had to use the column identifying the city twice, which wouldn’t work. So we artificially created a table corresponding to the number of bankruptcies per city, that we called “bankrupt’:

AS BANKRUPT

Now that we have a table companies and a table bankrupt, we tell the program which column to use for the linking:

ON bankrupt.city=companies.city

So the column “city” of the table bankrupt is identical to the column “city” of the table companies, which allows for joining. This method explains why in the first line we wrote companies.city, companies.id and bankrupt.count.

Lastly, we ordered by the number of bankruptcies and looked at the top20:

ORDER BY count DESC
LIMIT 20

Let’s now calculate the percentage.

SELECT companies.city, count(companies.id) AS total,bankrupt.count, 100.0*bankrupt.count/count(companies.id) AS percentage 
FROM companies 
LEFT OUTER JOIN (
     SELECT city, count(id) AS count 
     FROM companies 
     WHERE bankruptcy='Y' 
     GROUP BY city ) 
AS bankrupt 
ON bankrupt.city=companies.city 
GROUP BY companies.city
ORDER BY count DESC
LIMIT 20;

Here we simply added a new variable, which calculates the percentage of companies bankrupt:

100.0*bankrupt.count/count(companies.id) AS percentage

Nothing exceptional, just basic math. But notice the number 100.0: we added a decimal to make sure that the numbers are not rounded. If they were, all the cities under 1% bankruptcy rate would display 0.

To find the cities with the highest bankruptcy rate is a simple matter of changing ORDER BY count DESC into ORDER BY percentage DESC.

Further Reading

SQL is very powerful for data analysis.
You can train yourself using the website SQLZoo: http://sqlzoo.net/wiki/SQL_Tutorial or read more about it

Quiz

How do you get the number of cities in the dataset used above? Fill the gap

SELECT _______ FROM companies;

  • distinct(city)
  • count(distinct(city)) X
  • count(city)
  • distinct(count(city))

How could we determine the average age in the dataset about people?

(fill out the gap)

SELECT _____ from people;

  • mean(age)
  • average(age)
  • sum(age)/count(age)
  • sum(age)/(count(age)*1.0) X

Which of the following modes is not supported by sqlite? (for exporting data)

  • html
  • csv
  • SQL insert statements
  • json X

How can you aggregate company counts by city? (fill the gap)

SELECT city,count(id) from companies ________;

  • WHERE city = ‘Oslo’
  • ORDER BY count ASC
  • LIMIT 10
  • GROUP BY city X