Scraping more than one webpage: Scraperwiki
Note: Before proceeding into full scraping mode, it’s helpful to understand the flesh and bones of what makes up a webpage. Read the Introduction to HTML recipe in the handbook.
Until now we’ve only scraped data from a single webpage. What if there are more? Or you want to scrape complex databases? You’ll need to learn how to program – at least a bit.
It’s beyond the scope of this course to teach how to scrape, our aim here is to help you understand whether it is worth investing your time to learn, and to point you at some useful resources to help you on your way!
Structure of a scraper
Scrapers are comprised of three core parts:
- A queue of pages to scrape
- An area for structured data to be stored, such as a database
- A downloader and parser that adds URLs to the queue and/or structured information to the database.
Fortunately for you there is a good website for programming scrapers: ScraperWiki.com
ScraperWiki has two main functions: You can write scrapers – which are optionally run regularly and the data is available to everyone visiting – or you can request them to write scrapers for you. The latter costs some money – however it helps to contact the Scraperwiki community (Google Group) someone might get excited about your project and help you!.
If you are interested in writing scrapers with Scraperwiki, check out this sample scraper – scraping some data about Parliament. Click View source to see the details. Also check out the Scraperwiki documentation: https://scraperwiki.com/docs/python/
When should I make the investment to learn how to scrape?
A few reasons (non-exhaustive list!):
- If you regularly have to extract data where there are numerous tables in one page.
- If your information is spread across numerous pages.
- If you want to run the scraper regularly (e.g. if information is released every week or month).
- If you want things like email alerts if information on a particular webpage changes.
…And you don’t want to pay someone else to do it for you!
