Scraping websites using the Scraper extension for Chrome

If you are using Google Chrome there is a browser extension for scraping web pages. It’s called “Scraper” and it is easy to use. It will help you scrape a website’s content and upload the results to google docs.

Walkthrough: Scraping a website with the Scraper extension

  1. Open Google Chrome and click on Chrome Web Store

  2. Search for “Scraper” in extensions

  3. The first search result is the “Scraper” extension

  4. Click the add to chrome button.

  5. Now let’s go back to the listing of UK MPs

  6. Open http://www.parliament.uk/mps-lords-and-offices/mps/

  7. Now mark the entry for one MP

    http://farm9.staticflickr.com/8490/8264509932_6cc8802992_o_d.png

  8. Right click and select “scrape similar…”

    http://farm9.staticflickr.com/8200/8264509972_f3a9e5d8e8_o_d.png

  9. A new window will appear – the scraper console

    http://farm9.staticflickr.com/8073/8263440961_9b94e63d56_b_d.jpg

  10. In the scraper console you will see the scraped content

  11. Click on “Save to Google Docs…” to save the scraped content as a Google Spreadsheet.

Walkthrough: extended scraping with the Scraper extension

Note: Before beginning this recipe – you may find it useful to understand a bit about HTML. Read our HTML primer.

Easy wasn’t it? Now let’s do something a little more complicated. Let’s say we’re interested in the roles a specific actress played. The source for all kinds of data on this is the IMDB (You can also search on sites like DBpedia or Freebase for this kinds of information; however, we’ll stick to IMDB to show the principle)

  1. Let’s say we’re interested in creating a timeline with all the movies the Italian actress Asia Argento ever starred; where do we start?

  2. The IMDB has a quite comprehensive archive of actors. Asia Argento’s site is: http://www.imdb.com/name/nm0000782/

  3. If you open the page you’ll see all the roles she ever played, together with a title and the year – let’s scrape this information

  4. Try to scrape it like we did above

  5. You’ll see the list comes out garbled – this is because the list here is structured quite differently.

  6. Go to the scraper console. Notice the small box on the upper left, saying XPath?

  7. XPath is a query language for HTML and XML.

  8. XPath can help you find the elements in the page you’re interested in – all you need to do is find the right element and then write the xpath for it.

  9. Now let’s assemble our table.

  10. You’ll see that our current Xpath – the one including the whole information is “//div[3]/div[3]/div[2]/div”

    http://farm9.staticflickr.com/8344/8264510130_ae31697fde_o_d.png

  11. Xpath is very simple it tells the computer to look at the HTML document and select <div> element number 3, then in this the third one, the second one and then all <div> elements (which if you count down our list, results in exactly where you are right now.

  12. However, we’d like to have the data separated out.

  13. To do this use the columns part of the scraper console…

  14. Let’s find our title first – look at the title using Inspect Element

    http://farm9.staticflickr.com/8355/8263441157_b4672d01b2_o_d.png

  15. See how the title is within a <b> tag? Let’s add the tag to our xpath.

  16. The expression seems to work well: let’s make this our first column

  17. In the “Columns” section, change the name of the first column to “title”

  18. Now let’s add the XPATH for the title to it

  19. The xpaths in the columns section are relative, that means “./b” will select the <b> element

  20. add “./b” to the xpath for the title column and click “scrape”

    http://farm9.staticflickr.com/8357/8263441315_42d6a8745d_o_d.png

  21. See how you only get titles?

  22. Now let’s continue for year? Years are within one <span>

  23. Create a new column by clicking on the small plus next to your “title” column

  24. Now create the “year” column with xpath “./span”

    http://farm9.staticflickr.com/8347/8263441355_89f4315a78_o_d.png

  25. Click on scrape and see how the year is added

  26. See how easily we got information out of a less structured webpage?

Any questions? Got stuck? Ask School of Data!

Last updated on Sep 02, 2013.

  • Patrick La Salle

    anyway to export the scraped data as a file rather than import to google docs. I am scraping data that has and needs leading zeros unfortunately when you export to google docs default format opens and deletes the leading zeros.

    • Michael Bauer

      Unfortunately not with the Scraper extension – it only offers google doc exports.

      You can try other ways of getting the data – such as using google docs (for simple tables) or convextra.com – a bookmarklet to scrape websites. Or Scraperwiki (which gives you most control, but you’ll need to program your own scraper)

      • Nancy

        tried it. it does not work. it gives me ”some error”

        • Michael Bauer

          Convextra or the scraper extension? The scraper extension does have some issues with windows I haven’t been able to figure out :/

    • http://import.io/ Dan Cave

      If you want something that will export to Excel, HTML JSON or CSV and still free try http://import.io (Disclaimer: I do work for import.io).

  • Peter Olsen

    How do I stop it converting all dates to US format when I Export to Google Docs?

    eg. I have a date 25 Feb 1827. It converts it to 2/25/1827. When I then copy the file to Excel it cannot handle any dates earlier than 1/1/1900, so I am left with (thousands) of dates in a useless format.

  • owenWatson

    Are you sure about the extension? The only ones I can find in Chrome Store are ScreenScraper and Regex Scraper. Installed both of them and right-clicking on the selected text doesn’t bring up the Scrape item.

  • http://twitter.com/smichaelgriffin Michael Griffin

    Scraper is one of my fave extensions – I can’t believe it still isn’t appearing in Chrome store search!

  • miffysmiffy1979

    When I try and export to Google Docs, I get the message “We are unable to verify the name associated with this application because it runs on your computer, as opposed to on a website. We recommend you do not allow access unless you trust the application.” When I click “Allow”, it just hangs and never exports… any ideas?

    • sandesh

      Even I am facing the same issue.. did u get the solution?

  • http://webminer.avantprime.com Tom Snell

    For those of you who wish to use an independent scraper check out the Web Miner @ http://webminer.avantprime.com. While this is not build in to Chrome it is definitely one of the best scrapers I have used.

  • Milan Budimkic

    This is one of good solutions…But, is there some scraper for Facebook? I found one tool to sent promo posts on FB (http://post-it-quickly.com/), but what is quickest way to find (scrape) FB groups (maybe even related to promo topic)?

  • Revanth Gulla

    I am not unable to export the data to google spread sheet. Can you help me in this