How to scrape a website using Python + Scrapy in 5 simple steps

In this Python Scrapy tutorial, you will learn how to write a simple webscraper in Python using the Scrapy framework. The Data Blogger website will be used as an example in this article.

Scrapy: An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

By the way, if you are interested in scraping Tweets, you should definitely read this article.

Products from Amazon.com

    Introduction

    In this tutorial we will build the webscraper using only Scrapy + Python 3 (or Python 2) and no more! The tutorial has both Python 2 and Python 3 support. The possibilities are endless. Beware that some webscrapes are not legal! For example, although it is possible, it is not allowed to use Scrapy or any other webscraper to scrape LinkedIn (https://www.linkedin.com/). However, LinkedIn lost in one case in 2017.

    Content + Link extractor

    The purpose of Scrapy is to extract content and links from a website. This is done by recursively following all the links on the given website.

    Step 1: Installing Scrapy

    According to the website of Scrapy, we just have to execute the following command to install Scrapy:

    pip install scrapy

    Step 2: Setting up the project

    Now we will create the folder structure for your project. For the Data Blogger scraper, the following command is used. You can change datablogger_scraper to the name of your project.

    scrapy startproject datablogger_scraper

    Step 3: Creating an Object

    The next thing to do, is to create a spider that will crawl the website(s) of interest. The spider needs to know what data is crawled. This data can be put into an object. In this tutorial we will crawl internal links of a website. A link is defined as an object having a source URL and a destination URL. The source URL is the URL on which the link can be found. It also has a destination URL to which the link is navigating to when it is clicked. A link is called an internal link if both the source URL and destination URL are on the website itself.

    Scrape Object Implementation

    The object is defined in items.py and for this project, items.py has the following contents:

    import scrapy
    
    class DatabloggerScraperItem(scrapy.Item):
        # The source URL
        url_from = scrapy.Field()
        # The destination URL
        url_to = scrapy.Field()

    Notice that you can define any object you would like to crawl! For example, you can specify an object Game Console (with properties “vendor”, “price” and “release date”) when you are scraping a website about Game Consoles. If you are scraping information about music from multiple websites, you could define an object with properties like “artist”, “release date” and “genre”. On LinkedIn you could scrape a “Person” with properties “education”, “work” and “age”.

    Step 4: Creating the Spider

    Now we have encapsulated the data into an object, we can start creating the spider. First, we will navigate towards the project folder. Then, we will execute the following command to create a spider (which can then be found in the spiders/ directory):

    scrapy genspider datablogger data-blogger.com 

    Spider Implementation

    Now, a spider is created (spiders/datablogger.py). You can customize this file as much as you want. I ended up with the following code:

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractor import LinkExtractor
    from scrapy.spiders import Rule, CrawlSpider
    from datablogger_scraper.items import DatabloggerScraperItem
    
    
    class DatabloggerSpider(CrawlSpider):
        # The name of the spider
        name = "datablogger"
    
        # The domains that are allowed (links to other domains are skipped)
        allowed_domains = ["data-blogger.com"]
    
        # The URLs to start with
        start_urls = ["https://www.data-blogger.com/"]
    
        # This spider has one rule: extract all (unique and canonicalized) links, follow them and parse them using the parse_items method
        rules = [
            Rule(
                LinkExtractor(
                    canonicalize=True,
                    unique=True
                ),
                follow=True,
                callback="parse_items"
            )
        ]
    
        # Method which starts the requests by visiting all URLs specified in start_urls
        def start_requests(self):
            for url in self.start_urls:
                yield scrapy.Request(url, callback=self.parse, dont_filter=True)
    
        # Method for parsing items
        def parse_items(self, response):
            # The list of items that are found on the particular page
            items = []
            # Only extract canonicalized and unique links (with respect to the current page)
            links = LinkExtractor(canonicalize=True, unique=True).extract_links(response)
            # Now go through all the found links
            for link in links:
                # Check whether the domain of the URL of the link is allowed; so whether it is in one of the allowed domains
                is_allowed = False
                for allowed_domain in self.allowed_domains:
                    if allowed_domain in link.url:
                        is_allowed = True
                # If it is allowed, create a new item and add it to the list of found items
                if is_allowed:
                    item = DatabloggerScraperItem()
                    item['url_from'] = response.url
                    item['url_to'] = link.url
                    items.append(item)
            # Return all the found items
            return items

    A few things are worth mentioning. The crawler extends the CrawlSpider object, which has a parse method for scraping a website recursively. In the code, one rule is defined. This rule tells the crawler to follow all links it encounters. The rule also specifies that only unique links are parsed, so none of the links will be parsed twice! Furthermore, the canonicalize property makes sure that links are not parsed twice.

    LinkExtractor

    The LinkExtractor is a module with the purpose of extracting links from web pages.

    Step 5: Executing the Spider

    Go to the root folder of your project. Then execute the following command:

    scrapy crawl datablogger -o links.csv -t csv

    This command then runs over your website and generates a CSV file to store the data into. In my case, I got a CSV file named links.csv with the following content:

    url_from,url_to
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/category/cern/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,http://www.data-blogger.com/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/category/data-science/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/category/software-science/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/category/mathematics/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/category/projects/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/category/competition/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/about-me/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/contact/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/hire-me/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/author/admin/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.facebook.com/sharer/sharer.php?t=Monitoring+your+cluster+in+just+a+few+minutes+using+ISA&u=https%3A%2F%2Fwww.data-blogger.com%2F2016%2F07%2F18%2Fmonitoring-your-cluster-in-just-a-few-minutes%2F
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://twitter.com/intent/tweet?text=Monitoring+your+cluster+in+just+a+few+minutes+using+ISA&url=https%3A%2F%2Fwww.data-blogger.com%2F2016%2F07%2F18%2Fmonitoring-your-cluster-in-just-a-few-minutes%2F
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://plus.google.com/share?url=https%3A%2F%2Fwww.data-blogger.com%2F2016%2F07%2F18%2Fmonitoring-your-cluster-in-just-a-few-minutes%2F
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/tag/cluster/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/tag/isa/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/tag/monitoring/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/tag/software/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/2016/07/17/cern-deel-3-trip-naar-zurich/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/2016/07/19/project-euler-using-scala-problem-1/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/2016/08/13/apache-flink-the-next-distributed-data-processing-revolution/
    https://www.data-blogger.com/2016/07/18/monitoring-your-cluster-in-just-a-few-minutes/,https://www.data-blogger.com/2016/07/24/summing-the-fibonacci-sequence/
    https://www.data-blogger.com/2016/07/17/why-scala/,https://www.data-blogger.com/2016/07/17/why-scala/
    https://www.data-blogger.com/2016/07/17/why-scala/,https://www.data-blogger.com/
    ...

    Conclusion

    It is relatively easy to write your own spider with Scrapy. You can specify the data you want to scrape in an object and you can specify the behaviour of your crawler. If you have any questions, feel free to ask them in the comments section!

    Kevin Jacobs

    Kevin Jacobs is a certified Data Scientist and blog writer for Data Blogger. He is passionate about any project that involves large amounts of data and statistical data analysis. Kevin can be reached using Twitter (@kmjjacobs), LinkedIn or via e-mail: kevin8080nl@gmail.com. Want to write for our website? Then check out our write for us page!