Using Redis as a rapid discovery and prototyping tool.

April 11, 2013 in Uncategorized

Inspired by Rufus’ SQL for Data Analysis post, Iain Emsley wrote up his experience of using Redis for quick inspection of data. He writes:

Recently I was looking at some data about tweets and wanted to get a better idea of the number of users, the number of messages per user, and store the texts to run ad-hoc queries and prototype a web interface.

I felt that using a relational database would prove to unwieldy to make quick changes, so I thought I would try Redis. I had an idea of what I was looking for but was not entirely sure that the structures I had in mind would answer the query and using SQL potentially meant having to create or alter tables. Equally I wanted something which could be persistent rather than having to recalculate the raw data each time I wanted to look at it. Having stored the JSON data from the public search, I saved the data to a local directory which meant that I could comment out the URL handling and re-run the code with the same data each time.

I had used Redis before for other projects and knew that it was more than just a key-value server. It supports sets, sorted sets, lists and hashes out of the box with a simple set of commands. It can also persist data as well as store it in memory. The website is an invaluable and well written resource for further Redis commands. Armed with this and some Python, I was ready to begin diving into the data.

First, I created a set called ‘comments’ which contained all the relevant keys, such as names, to the dataset using the SADD command. By doing this, I can query what the names are at a later date, or query the amount of members the set has. As each tweet was being parsed, I added the name to the set which meant that I could capture every name but without duplicates. Using the user’s name as a key I would then build up a set of counts and store the tweets for that correspondent.

As each tweet was being parsed, I created a key for the count, such as count::<name>. Against this I asked Redis to increase the count using INCR. Rather than having to check if the key exists and the increment the count, Redis will increase the count or just start it if the key doesn’t exist yet.

As each count was created, I then stored the text as a part of a simple list against the correspondent using the RPUSH command and adding the new data to the end of the list. Using the key of tweets:: meant that I could store them and present to the web page at a later date. By storing the time in the value, I could run some very basic time queries but it also meant that I could re-run other queries to look at books mentioned, any mentions of other Twitter users (as I discovered Twitter’s internal representation does not appear to be complete; something which I discovered doing this work).

  import json
  import redis
  import glob
  import unicodedata 
  from urllib2 import urlopen, URLError
  rawtxt = '/path/to/data/twitter/'
  tag = '' #set this to be the tag to search: %23okfest
  for i in range(1,23):
      mievurl = ''+ tag +'&page='+str(i)
      turl = urlopen(mievurl)
      fh = open(rawtxt +str(i)+'.txt', 'wb').write(
  r = redis.StrictRedis(host='localhost', port=6379, db=0)
  txtf = glob.glob(rawtxt +'*.txt')
  for ft in txtf:
      fh = open(ft).read()   
      data = json.loads(fh)
      for d in data['results']:
          if not d.get('to_user',None): d['to_user']  = ''
          #use the text is normalised into unicode
          d['text'] = unicodedata.normalize('NFKD', d['text'])
          r.sadd('comments', d['from_user']) #add user to the set
          r.incr('count::'+ str(d['from_user'])) #count how may times they occur
          r.rpush('tweets::'+ str(d['from_user']), str(d['created_at']) + '::'+unicode(d['text'])) #store the text and time
          #store any mentions in the JSON
          if 'None' not in str(d['to_user']):
              r.rpush('mentions::'+str(d['from_user']), str(d['to_user']))

  members = r.smembers('comments') #get the all the users from the set

  people=[m for m in members]
  counts=[r.get('count::'+member) for member in people]
  tweets=[r.lrange('tweets::'+m,0,-1)[0].split("::")[1] for m in people]

  #dump the raw counts to look at the data
  print counts
  print people
  print tweets

Having used a simple script to create the data, I then used some of the command line functions in Redis to view the data and also wrote a very simple website to prototype how it might look. As I had used Redis and stored the raw data, I was able to go back and rewrite or alter the queries easily to view more data and improve the results with a minimum of trouble.

By using sets, I could keep track of which keys where relevant to this dataset. The flexibility of keys can allow slicing of data to explore it query, even add to it against different keys or even different data structures as needs change. Rather than having to know the schemas or rewrite SQL queries, Redis only really demands that you need to know the data structures that you want to use. Even if you get these wrong, changing them is a very quick job. It also means that when you retrieve the data, you can manipulate and re-present the data at the code level rather than having to potentially make large changes each time to a database.

Due to its simplicity, I was able to “slice and dice” the data as well as create a quick web site to see if the visualisations might work. It has been a huge help in getting some ideas of the page and into code for some future projects. I’ll be keeping these tools in my tool set for the future.

Recently did a data project and learned something new? Contact us and share it with our community

Flattr this!