Scrapy

From ActiveArchives
Jump to: navigation, search

According to http://scrapy.org/:

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Install with pip,

sudo pip install scrapy

I decide to follow the tutorial.

But I am scraping the HISK website (http://hisk.edu)

I used urlparse to get the filename:

from scrapy.spider import BaseSpider
import urlparse, os
 
class HiskSpider(BaseSpider):
    name = "hisk"
    allowed_domains = ["hisk.edu"]
    start_urls = [
        "http://hisk.edu/lecturers.php?la=en",
    ]
 
    def parse(self, response):
        filename = os.path.split(urlparse.urlparse(response.url).path)[1]
        open(filename, 'wb').write(response.body)
  • Nice tutorial on XPaths
  • scrapy's "shell mode" is very nice: give a URL, and it dumps you into an interactive python session with the parsed page to test out xpaths
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from hisk.items import HiskLecturer
import urlparse, os
 
class HiskSpider(BaseSpider):
    name = "hisk"
    allowed_domains = ["hisk.edu"]
    start_urls = [
        "http://hisk.edu/lecturers.php?la=en",
    ]
 
    def parse(self, response):
        # filename = os.path.split(urlparse.urlparse(response.url).path)[1]
        # open(filename, 'wb').write(response.body)
        hxs = HtmlXPathSelector(response)
        links = hxs.select('//ul[@class="lecturers"]//a')
        lecturers = []
        for link in links:
            lecturer = HiskLecturer()
            lecturer['name'] = link.select('text()')[0].extract()
            lecturer['url'] = link.select('@href')[0].extract()
            lecturers.append(lecturer)
        return lecturers


Via the command:

scrapy crawl hisk -o scrape.json -t json

produces:

[{"url": "lecturers.php?la=en&id=184&t=current&y=2011", "name": "Roel Arkesteijn"},
{"url": "lecturers.php?la=en&id=309&t=current&y=2011", "name": "Charif Benhelima"},
{"url": "lecturers.php?la=en&id=262&t=current&y=2011", "name": "Pierre Bismuth"},
{"url": "lecturers.php?la=en&id=329&t=current&y=2011", "name": "Jota Castro"},
{"url": "lecturers.php?la=en&id=131&t=current&y=2011", "name": "Bart De Baere"},
{"url": "lecturers.php?la=en&id=189&t=current&y=2011", "name": "Jan Debbaut"},
{"url": "lecturers.php?la=en&id=313&t=current&y=2011", "name": "Stella Lohaus"},
{"url": "lecturers.php?la=en&id=318&t=current&y=2011", "name": "Mihnea Mircan"},
{"url": "lecturers.php?la=en&id=328&t=current&y=2011", "name": "Vanessa Joan  M\u00fcller"},
{"url": "lecturers.php?la=en&id=117&t=current&y=2011", "name": "Gertrud Sandqvist"},
{"url": "lecturers.php?la=en&id=326&t=current&y=2011", "name": "Nicolaus Schafhausen"},
{"url": "lecturers.php?la=en&id=89&t=current&y=2011", "name": "Anna Tilroe"},
{"url": "lecturers.php?la=en&id=90&t=current&y=2011", "name": "Guy Van Belle"},
{"url": "lecturers.php?la=en&id=203&t=current&y=2011", "name": "Philippe Van Cauteren"},
{"url": "lecturers.php?la=en&id=287&t=current&y=2011", "name": "Jan Van Imschoot"},
{"url": "lecturers.php?la=en&id=208&t=current&y=2011", "name": "Jan Van Woensel"},
{"url": "lecturers.php?la=en&id=327&t=current&y=2011", "name": "Mirjam Varadinis"},
{"url": "lecturers.php?la=en&id=325&t=current&y=2011", "name": "Jonas \u017dakaitis"}]

Nice, but now I wonder how to recursively follow the links, as this is where the actual Lecturer data (text & years) is.

Time to understand the scraping cycle

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox