An Introduction to SQL Databases for Data Analysis – Part 1

This module provides a basic introduction to the SQLite database and the SQL database query language. This first part introduces SQLite and illustrates how to start working with it.

Module Objectives

  • Understand basic SQL
  • Understand importing and exporting of Data

Prerequisites

  • SQLite – a simple, self-contained, serverless relational database. Comes preinstalled on Mac OSX, for Windows Download it here

Introduction

Spreadsheets are great for data analysis but when it comes to larger amounts of data or more complex data structures – databases really outperform them. The most common type of databases used are so-called “SQL” databases. SQL is short for “Structured Query Language” – a special language to send requests to the database and get some results out of it (or data into it). While there is some standard for SQL – every SQL-database uses a slightly different dialect. Don’t be afraid – the basics are the same.

The Basics

As we have already mentioned, many database management systems support SQL. Whilst we could install an industry strength system to learn SQL with, there is also a much more convenient – and cross-platform – route we can get started with: SQLite.

SQLite is a self-contained database system that doesn’t require any sort of server support, which means it’s easy to get up an running. To work through this activity, you will need to install SQLite on your computer.

Download and install SQLite: find the appropriate Precompiled Binaries link for your platform and download and unzip the corresponding file.

Starting SQLite

The L in SQL stands for language – that means you have to speak to your computer – on a command line. Don’t be afraid: the basics of the language are not too hard to learn (and for more complex things there is always internet search).

On Mac

Open the application “Terminal” and type sqlite3.

On Windows

Save the Sqlite executable in the folder you want to work with and double click it.

The command line

Sqlite starts up in a command line. You should now see some information about the version of sqlite you’re using and a prompt saying sqlite. Don’t worry you’re not stuck in here: You can always exit sqlite by typing .quit.

SQLite version 3.7.13 2012-07-17 17:46:21
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>

The first thing we want to look at is some built in functions of sqlite. Type .help followed by enter to see some of the commands you have in there – this is useful for reference. Note: These commands are not part of the sql language.

Creating your first table

Data in databases is stored in tables. To use spreadsheets as an analogy: a database is a workbook, a table is a single sheet. As in well designed spreadsheets data is organized in rows and columns. With the single variables in columns and each row a specific data point. Unlike spreadsheets, databases are a lot more stringent when it comes to what is in a column. To store data we have to explicitly define what is to be stored in a table – let’s do this.

We’ll store the following data in the table.

Name Mood Age
Michael Grumpy 33
Anders Happy 35
Zara Happy 28

First, we need to define the columns – we have Name (which is text), Mood (also text) and Age (a number without decimals). We’ll tell the computer: Make a table with the columns: Name Text, Mood Text, Age Number. Let’s translate this to SQL. As with natural language we use words to tell the computer what we want:

CREATE TABLE people (name char(255), mood char(255), age int);

Note several things: First our text has become a “char” – this is because names are reasonably short and we estimate the maximum length for a name at 255 characters. Char actually stands for character – here we’re telling the database we want a field that’s at max 255 characters long. The same for mood. Age has become an “Integer” that is a number without decimals – also see the “;” at the end. This tells SQL this is where our sentence stops. The sentences we’re telling SQL are called “queries”. You will have noticed that there are certain words written in all caps above. This is not required but standard convention in SQL: It makes it easier to read for humans.

Fantastic, we have now created our first table. We can find it using the sqlite command .tables which will return:

people

and look at it’s definition using .schema people, which will return our original query:

CREATE TABLE people (name char(255), mood char(255), age int);

This is a very easy and small table – for larger tables and for interactions across multiple tables we will use one or more columns to uniquely identify each row. This column is specified as the “primary key”. Having a primary key helps you to retrieve data from your database much faster if you use to access it.

Now let’s add some data to it – we will tell the database to insert a record for Michael, who is grumpy and 33 years old. In SQL this is

INSERT INTO people VALUES (’Michael’,’Grumpy’,33);

Let’s do the same for Anders and Zara:

INSERT INTO people VALUES (’Anders’,’Happy’,35);
INSERT INTO people VALUES (’Zara’,’Happy’,28);

Fantastic! We’ve entered some information into the table! Note: You can omit the column names only if you’re entering all values and in the order of the columns. Otherwise you have to specify the column. E.q. if we don’t want to enter a mood we can use

INSERT INTO people (name, age) VALUES ('Tom', 33);

Filtering and Sorting SQL style

Let’s now look at our data. To see the contents of our table we can tell the database: Show me everything in the people table. In SQL:

SELECT * FROM people;

Note the * – it is a “wildcard” character. This means as much as select all the columns.

The database will respond with:

Michael|Grumpy|33
Anders|Happy|35
Zara|Happy|28

Fantastic! We can now retrieve our data. But databases would be pretty useless if we could just get everything we stored (Why not use a file in that case…?) – then excel when it comes to sorting and filtering!

In a giant database, you might want to get all the records that correspond to a particular entry (name = ‘John’, age = 51). This is when databases become very useful, as SQL queries are very powerful filters. Let’s now only get the happy people. We’ll tell the database show me everything in the people table if the mood is happy. in SQL:

SELECT * FROM people WHERE mood=’Happy’;

Databases can do this very very quickly even if the dataset you work with is giant. Note the WHERE clause – this defines our filters (here give me all the rows where the mood is happy). You can combine multiple conditions with AND, OR and NOT.

Another thing we might want to do is sorting (and figuring out the range of values within a column).

Let’s first sort the dataset by age – youngest first. We say: give me everything ordered by age ascending. SQL:

SELECT * FROM people ORDER BY age ASC;

Wasn’t that easy? Of course you can combine filters and sorting. ORDER BY serves as another kind of filter. Note that ASC stands for Ascending (and DESC stands for descending)

Maxima and Minima

Let’s figure out the maximum and minimum age.

we’ll say give us the maximum age of the people in SQL:

SELECT max(age) FROM people;

Can you figure out how to get the minimum? Note this will only return the maximum age – not the entry with the maximum age.

A good thing about SQL is it allows us to use nested queries (queries within queries). E.g. if we want to find out which people have the maximum age we can use:

SELECT * FROM people WHERE age=(SELECT max(age) FROM people);

And see it’s Anders!

Nested Queries

A very powerful feature of SQL is that allows us to use nested queries (queries within queries). For example, if we want to find out which people have the maximum age we can use:

SELECT * FROM people WHERE age=(SELECT MAX(age) FROM people);

And see it’s Anders!

Exporting and Importing Data

As well as querying data contained within a database, we may also need to be able to get data into and out of the database in bulk.

Exporting Data

Often you don’t want to have the data you just analyzed live in the SQL database only – you might want to export it so you can work with it. Sqlite supports exporting very nicely.

Using the command .mode helps you change the format of the result of your queries. By default, it is list, so our table will appear like this:

Michael|Grumpy|33
Anders|Happy|35
Zara|Happy|28

But you can change it to other formats like CSV (comma separated values), so our table will look like this:

Michael,Grumpy,33
Anders,Happy,35
Zara,Happy,28

If you use the command .header on you will also get a header row:

name,mood,age
Michael,Grumpy,33
Anders,Happy,35
Zara,Happy,28

The command .output allows you to choose where the results of your queries will be sent. By default, they are simply saved in memory in displayed on your screen. If you want to save your results in a CSV file, you should write:

.output yourfile.csv

and now any query you will be doing will be written to the file. Note that the file is saved in the folder where you terminal was when you launched sqlite. By default, the terminal operates in your Home folder.

If you don’t want your queries to be written in the file anymore, .output stdout reverts back to printing them on your screen.

If you decide to export your filtered data to a file using .mode, a good practice is to prepare your query until you like it and then execute it with a file it is written to.

Importing Data

Importing data works similarly. However, we have to prepare our table first. If you do have a .csv file, find out what columns you have and what data is in there. Create your table accordingly. Then set your .mode to csv as described above and use .import myfile.csv mytable

Check whether the header row has actually been imported as data and delete if necessary.