Scraping a website With Python + Scrapy

Logo of Scrapy.

Logo of Scrapy.

In this tutorial, you will learn how to write a simple webscraper in Python using the Scrapy framework. The Data Blogger website will be used as an example in this article.

An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

Content + Link extractor

The purpose of Scrapy is to extract content and links from a website by recursively following all the links on the given website.

Installing Scrapy

According to http://scrapy.org/, we just have to execute the following command to install Scrapy:

pip install scrapy

If this does not work, please have a look at the detailed installation instructions.

Setting up the project

Now we will create the folder structure for your project. For the Data Blogger scraper, the following command is used. You can change datablogger_scraper to the name of your project.

scrapy startproject datablogger_scraper

Creating an Object

The next thing to do, is to create a spider that will crawl the website(s) of interest. The spider needs to know what data is crawled. This data can be put into an object. In this tutorial we will crawl internal links of a website. A link is defined as an object having a source URL on which the link can be found and it has a destination URL to which the link is navigating to when it is clicked. It is called an internal link if both the source URL and destination URL are on the website itself. The object is defined in items.py and for this project, items.py has the following contents:

import scrapy

class DatabloggerScraperItem(scrapy.Item):
    # The source URL
    url_from = scrapy.Field()
    # The destination URL
    url_to = scrapy.Field()

Notice that in your project, you can define any object you would like to crawl! For example, you can specify an object Game Console (with properties “vendor”, “price” and “release date”) when you are scraping a website about Game Consoles. If you are scraping information about music from multiple websites, you could define an object with properties like “artist”, “release date” and “genre”.

Creating the Spider

Now we have encapsulated the data into an object, we can start creating the spider. First, we will navigate towards the project folder and then we will execute the following command to create a spider (which can then be found in the spiders/ directory):

scrapy genspider datablogger data-blogger.com 

Now, a spider is created (spiders/datablogger.py). You can customize this file as much as you want. I ended up with the following code:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractor import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from datablogger_scraper.items import DatabloggerScraperItem


class DatabloggerSpider(CrawlSpider):
    # The name of the spider
    name = "datablogger"

    # The domains that are allowed (links to other domains are skipped)
    allowed_domains = ["data-blogger.com"]

    # The URLs to start with
    start_urls = ["https://www.data-blogger.com/"]

    # This spider has one rule: extract all (unique and canonicalized) links, follow them and parse them using the parse_items method
    rules = [
        Rule(
            LinkExtractor(
                canonicalize=True,
                unique=True
            ),
            follow=True,
            callback="parse_items"
        )
    ]

    # Method which starts the requests by visiting all URLs specified in start_urls
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse, dont_filter=True)

    # Method for parsing items
    def parse_items(self, response):
        # The list of items that are found on the particular page
        items = []
        # Only extract canonicalized and unique links (with respect to the current page)
        links = LinkExtractor(canonicalize=True, unique=True).extract_links(response)
        # Now go through all the found links
        for link in links:
            # Check whether the domain of the URL of the link is allowed; so whether it is in one of the allowed domains
            is_allowed = False
            for allowed_domain in self.allowed_domains:
                if allowed_domain in link.url:
                    is_allowed = True
            # If it is allowed, create a new item and add it to the list of found items
            if is_allowed:
                item = DatabloggerScraperItem()
                item['url_from'] = response.url
                item['url_to'] = link.url
                items.append(item)
        # Return all the found items
        return items

A few things are worth mentioning. The crawler extends the CrawlSpider object, which has a parse method for scraping a website recursively. In the code, one rule is defined which tells the crawler to follow all links it encounters. The rule also specifies that only unique links are parsed, so none of the links will be parsed twice! Furthermore, the canonicalize property makes sure that links are not parsed twice.

LinkExtractor

The LinkExtractor is a module with the purpose of extracting links from web pages.

Executing the Spider

Go to the root folder of your project. Then execute the following command:

scrapy crawl datablogger -o links.csv -t csv

This command then runs over your website and generates a CSV file. In my case, I got a CSV file named links.csv with the following content:

url_from,url_to
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/category/cern/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,http://www.data-blogger.com/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/category/data-science/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/category/software-science/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/category/mathematics/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/category/projects/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/category/competition/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/about-me/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/contact/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/hire-me/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/author/admin/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.facebook.com/sharer/sharer.php?t=Monitoring+your+cluster+in+just+a+few+minutes+using+ISA&u=https%3A%2F%2Fwww.data-blogger.com%2F2016%2F07%2F18%2Fmonitoring-your-cluster-in-just-a-few-minutes%2F
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://twitter.com/intent/tweet?text=Monitoring+your+cluster+in+just+a+few+minutes+using+ISA&url=https%3A%2F%2Fwww.data-blogger.com%2F2016%2F07%2F18%2Fmonitoring-your-cluster-in-just-a-few-minutes%2F
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://plus.google.com/share?url=https%3A%2F%2Fwww.data-blogger.com%2F2016%2F07%2F18%2Fmonitoring-your-cluster-in-just-a-few-minutes%2F
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/tag/cluster/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/tag/isa/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/tag/monitoring/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/tag/software/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/2016/07/17/cern-deel-3-trip-naar-zurich/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/2016/07/19/project-euler-using-scala-problem-1/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/2016/08/13/apache-flink-the-next-distributed-data-processing-revolution/
https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/2016/07/24/summing-the-fibonacci-sequence/
https://www.data-blogger.com/2016/07/17/why-scala/,https://www.data-blogger.com/2016/07/17/why-scala/
https://www.data-blogger.com/2016/07/17/why-scala/,https://www.data-blogger.com/
...

Conclusion

It is relatively easy to write your own spider with Scrapy. You can specify the data you want to scrape in an object and you can specify the behaviour of your crawler. If you have any questions, feel free to ask them in the comments section!

Kevin Jacobs

Kevin Jacobs

Kevin Jacobs is a certified Data Scientist and blog writer for Data Blogger. He is passionate about any project that involves large amounts of data and statistical data analysis. Kevin can be reached using Twitter (@kmjjacobs), LinkedIn or via e-mail: mail@kevinjacobs.nl.

  • dattasai vadapalli

    Hi
    I have did the same cod eof yours in my local and it worked fine, but as I tried with other url it is not working and it is throwining an error

    • Kevin Jacobs

      Hi,

      Please make sure to replace http://www.data-blogger.com in the allowed_domains and start_domains variables by your own domain:

      # The domains that are allowed (links to other domains are skipped)
      allowed_domains = [“example.com”]

      # The URLs to start with
      start_urls = [“https://www.example.com/”]

      Otherwise the crawler will start and stay on http://www.data-blogger.com.

  • dattasai vadapalli
  • dattasai vadapalli

    I got the links in csv file still i am getting somoe errors in the console. can u pls check and tel what type of errors those are and how to get rid of those.
    https://uploads.disquscdn.com/images/ae9c7ab202d1cefe11dffa1de5cd025ff10ae3ba144b132272b2dc59ca4484c7.png

    thanks in advance

  • dattasai vadapalli

    Hi Kevin ,

    Can you tell me how to write spider which retrieve all the links along with respective responses into different files
    suppose, response 200 will be in success file
    403 in another file like that

    Thanks in advance.

    • Kevin Jacobs

      Hi,

      You can create different crawlers for that purpose (to keep things simple). say a Crawler403 and a Crawler200. Then, in the parse method you can check the status of your response as follows:


      def parse(self, response):
      if response.status == 403:
      # Your code here

      Scrapy only considers response codes 200 – 300 by default. If you want to allow for example a 403 status code, you have to tell Scrapy the following:


      class Crawler403:
      handle_httpstatus_list = [403]

      def parse(self, response):
      # Etcetera

      Good luck with coding! I will not post the complete solution here, but I think this should be a good starting point :-).

  • Kevin Jacobs

    Hi,

    You can create different crawlers for that purpose (to keep things simple). say a Crawler403 and a Crawler200. Then, in the parse method you can check the status of your response as follows:


    def parse(self, response):
    if response.status == 403:
    # Your code here

    Scrapy only considers response codes 200 – 300 by default. If you want to allow for example a 403 status code, you have to tell Scrapy the following:


    class Crawler403:
    handle_httpstatus_list = [403]

    def parse(self, response):
    # Etcetera

    Good luck with coding! I will not post the complete solution here, but I think this should be a good starting point :-).

  • dattasai vadapalli

    how to write the links to different files from the same crawler based on the response status

    • Kevin Jacobs

      See my last reply and try to change the following code snippet from the blog post in your favour 🙂

      scrapy crawl datablogger -o links.csv -t csv

      You can also create a base class if you’d like to have one crawler and extend this class with different response code logic.

      • dattasai vadapalli

        not while running i am asking in the code itslef can we write the links into a file??
        like
        fwrite(filename, mode) ..

        • Kevin Jacobs

          Yes, that is possible. Try this:

          # Method for parsing items
          def parse_items(self, response):
          # The list of items that are found on the particular page
          items = []
          # Only extract canonicalized and unique links (with respect to the current page)
          links = LinkExtractor(canonicalize=True, unique=True).extract_links(response)
          # Now go through all the found links
          for link in links:
          # Check whether the domain of the URL of the link is allowed; so whether it is in one of the allowed domains
          is_allowed = False
          for allowed_domain in self.allowed_domains:
          if allowed_domain in link.url:
          is_allowed = True
          # If it is allowed, create a new item and add it to the list of found items
          if is_allowed:
          item = DatabloggerScraperItem()
          item['url_from'] = response.url
          item['url_to'] = link.url
          items.append(item)
          # Write to file depending on the status code
          with open('response_%d.txt' % response.status, 'a') as output_file:
          # Write out all found links and put a tab between the from URL and destination URL
          for link in items:
          output_file.write('%st%sn' % (item['url_from'], item['url_to'])
          # Return all the found items to continue parsing
          return items

          Haven’t tested it, but something like this should work for you?

          • dattasai vadapalli

            yes i will try and let you know back
            Thanks

  • dattasai vadapalli

    how to display the time taken for the crawling. i have tried as:
    https://uploads.disquscdn.com/images/21b4504bec1b6708a56d11d9a9577ecfcf55d0782774de50a282cdb0630512b1.png
    and the output is https://uploads.disquscdn.com/images/5bb05d406085b42131c2c545d1c0abfb286e09249694491cb25db2cc6e4b9575.png

    I am not able to get the time taken for this simple crawling.
    Pls help me on this to display the time taken for crawling in hours or minutes
    Thanks in advance

    • Kevin Jacobs

      Hi,

      You can try to log the time in the constructor of the class:


      def __init__(self):
      self.init_time = timeit.default_timer()

      def parse(self, response):
      ...
      print stop - self.init_time