Irfan Ahmad has recently collected data on twitter users in Pakistan. He writes:
Twitter has recently become very popular in Pakistan as lots of journalists, TV anchors, celebrities and politicians have started using it very actively. There are reports of political parties establishing proper Social Media Cells for managing their online identities on sites like twitter and facebook. There was a need to find out answers for the questions like:
- How many twitter users are there in Pakistan?
- How are Pakistani twitter users distributed demographically?
- Who are most influential twitter users in Pakistan?
Twitter reveals no such data but twitter’s API is there exposing public profiles. Our assumption was that every Pakistani twitter user must have
some specific keywords or geo coordinates as their location in their profiles.
Although, the assumption skips many Pakistanis as many of them never filled in their profile in detail and many live abroad but we were able to find almost all commonly known popular twitter users using this assumption. Twitter API also provides time zone information along with each profile. We used this information as well.
Another problem with this logic was that there are many keywords, like Kashmir, Gujrat, Hyderabad etc., common in India and Pakistan. So, another trivial part was to exclude those users from being crawled. For this, importance of these keywords was minimized and some block keywords were introduced. This approach filtered many Indian twitter users from the results.
Using this logic we started a crawler which runs days and nights to find more and more Pakistani tweeps. The crawler is quite slower because of rate limits imposed by twitter in its API. This crawler has been running for a few months and so far it has crawled about 155,000 Pakistani twitter users into its database. Which is a significant dataset for our data analysis.
Second part of the project was to find more influential twitter users. After analysing top twitter users by followers count it was concluded that followers count is not very authentic influence count. There are lots of other factors involved. Fortunately, there is a very nice application, Klout, which measures influence score. So, next step is to write a crawler which fetches klout score of all Pakistani twitter users. Klout has a very reasonable rate limit and with the current database size it is possible to update klout score of each user within a week. This crawler runs days and nights as well.
Both of these crawlers are coded in Python and they store their results in Redis. They utilize Python’s multiprocessing. Following python libraries were used:
Finding Geographical Location:
First part was to find out where are Pakistani twitter users located geologically. Our crawler assigns a city or province to each user based upon the keyword matched against his location in twitter profile.
Twitter does not provide gender of the twitter user with each user but it can easily be guessed from user’s name. For this analysis gender.c was used with a custom names database of 5000 Pakistani names.
Finding most commonly used names:
This was easy, just a few lines of Python code and result was compiled.
Finding most commonly used words in Pakistani tweeps profiles
There were some other results generated using simple queries like:
- Finding verified twitter accounts
- Finding users with the most followers
- When Pakistanis joined twitter?
- Finding most influential twitter users in Pakistan
Now the most important part. As we are also crawling user’s klout score, this was again simply a query to our data.
Now this is quite interesting data. Based on this, some more facts were collected like:
- Out of top 100 most influential Pakistani twitter accounts, 1/3rd are related to traditional media (mostly talk show anchors)
- There are about 10K Pakistani twitter users with no followers and 12K accounts with just 1 follower (most of them were fake accounts created by social media cells of political parties to increase follower count)
- On average each Pakistani tweep is followed by 129 tweeps
- There are 24 Pakistan twitter users with 50K+ followers and only 11 users with 100K+ followers
- Out of 150K Pakistani twitter users, almost half have less than 10 followers and a quarter have 11-50 followers
What I learnt from this project
- NoSQL is great in handling large datasets
- Whenever using some API or scraping, use unicode.
- Handling rate limits of APIs with multiprocessing can be challenging.
- NLTK is a great library if used properly.