In this tutorial, you will learn how to write a simple webscraper in Python using the Scrapy framework. The Data Blogger website will be used as an example in this article.
An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
Content + Link extractor
The purpose of Scrapy is to extract content and links from a website by recursively following all the links on the given website.
According to http://scrapy.org/, we just have to execute the following command to install Scrapy:
If this does not work, please have a look at the detailed installation instructions.
Setting up the project
Now we will create the folder structure for your project. For the Data Blogger scraper, the following command is used. You can change datablogger_scraper to the name of your project.
Creating an Object
The next thing to do, is to create a spider that will crawl the website(s) of interest. The spider needs to know what data is crawled. This data can be put into an object. In this tutorial we will crawl internal links of a website. A link is defined as an object having a source URL on which the link can be found and it has a destination URL to which the link is navigating to when it is clicked. It is called an internal link if both the source URL and destination URL are on the website itself. The object is defined in items.py and for this project, items.py has the following contents:
Notice that in your project, you can define any object you would like to crawl! For example, you can specify an object Game Console (with properties “vendor”, “price” and “release date”) when you are scraping a website about Game Consoles. If you are scraping information about music from multiple websites, you could define an object with properties like “artist”, “release date” and “genre”.
Creating the Spider
Now we have encapsulated the data into an object, we can start creating the spider. First, we will navigate towards the project folder and then we will execute the following command to create a spider (which can then be found in the spiders/ directory):
Now, a spider is created (spiders/datablogger.py). You can customize this file as much as you want. I ended up with the following code:
A few things are worth mentioning. The crawler extends the CrawlSpider object, which has a parse method for scraping a website recursively. In the code, one rule is defined which tells the crawler to follow all links it encounters. The rule also specifies that only unique links are parsed, so none of the links will be parsed twice! Furthermore, the canonicalize property makes sure that links are not parsed twice.
The LinkExtractor is a module with the purpose of extracting links from web pages.
Executing the Spider
Go to the root folder of your project. Then execute the following command:
This command then runs over your website and generates a CSV file. In my case, I got a CSV file named links.csv with the following content:
It is relatively easy to write your own spider with Scrapy. You can specify the data you want to scrape in an object and you can specify the behaviour of your crawler. If you have any questions, feel free to ask them in the comments section!