In this article, we will build a basic news search engine that is capable of finding news by keywords. Since this is a complex system, I will first split the system up into smaller modules. The first module is the module that retrieves all news from the internet. This module is called a scraper (or web scraper) and is written in Python. It maintains a file called the index. This is a file that contains a list of documents per keywords. For example, several documents contain the term “music”, so the index contains the term “music” and a list with references to all documents that contain the word “music”. But we will first start with our scraper.
For the web scraper, we will use a queue which contains pages which are about to be scraped. Once these pages are handled, these are put onto a list such that we can ensure that pages are only handled once. There are lots of difficulties which you will encounter during web scraping. One of the issues is that URLs with a hash are (statically) equivalent to URLs without a hash. For example, http://domain.com/#hashpart is statically the same as http://domain.com/. This hashpart can be removed easily:
Another issue is that there are two types of links: links are either relative or absolute. Relative means that the link does not start with the full domain name. For example, http://domain.com/page is an absolute URL and /page is a relative URL. If the base URL was http://domain.com/ then the absolute URL and the relative URL are the same. If the base URL would be http://test.com/, then the absolute URL using this base URL would be http://test.com/page. Making relative links absolute is done using the following method:
Suppose we want to scrape http://domain.com/ and http://www.test.com/. Then we don’t want links starting with http://external.com/. http://domain.com/ and http://www.test.com/ are called base URLs. The following code goes through a given list of URLs and checks whether they are internal, i.e. they are starting with URLs which are given in a list called base_urls:
The implementation for the queue is straightforward. All of the functionality is combined into one class called Scraper:
Now it is easy to call the scraper:
This will give the following output:
Scraping... 1 / 5 (http://www.bbc.com/) Scraping... 2 / 5 (http://phys.org/) Scraping... 3 / 5 (http://phys.org/help/) Scraping... 4 / 5 (http://www.bbc.com/tv/) Scraping... 5 / 5 (http://phys.org/feeds/)
Now we have the tools to scrape pages. The next step is to make the found text searchable and this will be done in our search module.
By the way, if you are interested in implementing a web scraper in Python and Scrapy, you should definitely read this article. It explains step-by-step how you should implement it.
Search module (a.k.a. Indexer)
Now we will create our search module. In fact, this is an indexer (actually, an inverse index is build). An index can be found in almost every book. At the end of a book is a large list of terms referring to pages where the word occurs. The indexer does the same. It makes a list of words an refers to documents where words occur in. And actually, only a few things have to occur. We have to extract tokens from a text. To simplify things, we refer to tokens here as words which are separated by spaced. The text is first tokenized (in other words, the tokens are extracted) and for every token, an entry is added to the index to the corresponding URL. This results in the following code:
Combining the modules
In order to make the two modules work together, a slight modification must be made to our Scraper module. Change the scrape method to the following:
And now everything is working together! In order to find sport related URLs, you just have to use the following piece of code:
The result was the following:
Scraping... 1 / 20 (http://www.bbc.com/) Scraping... 2 / 20 (http://phys.org/) Scraping... 3 / 20 (http://phys.org/help/) Scraping... 4 / 20 (http://www.bbc.com/tv/) Scraping... 5 / 20 (http://phys.org/feeds/) Scraping... 6 / 20 (http://www.bbc.com/news) Scraping... 7 / 20 (http://www.bbc.com/cbbc) Scraping... 8 / 20 (http://phys.org/weblog/) Scraping... 9 / 20 (http://phys.org/search/) Scraping... 10 / 20 (http://www.bbc.com/sport) Scraping... 11 / 20 (http://www.bbc.com/urdu/) Scraping... 12 / 20 (http://www.bbc.com/food/) Scraping... 13 / 20 (http://www.bbc.com/autos) Scraping... 14 / 20 (http://www.bbc.com/arts/) Scraping... 15 / 20 (http://www.bbc.com/earth) Scraping... 16 / 20 (http://www.bbc.com/news/) Scraping... 17 / 20 (http://phys.org/archive/) Scraping... 18 / 20 (http://www.bbc.com/cbbc/) Scraping... 19 / 20 (http://www.bbc.com/local/) Scraping... 20 / 20 (http://www.bbc.com/hausa/) Found sport articles: http://www.bbc.com/sport
So, now our basic news search engine indeed has found the correct URL!
What are the next steps? Consider the tokenizer. If you really want to do a good job, dive into a text mining book and learn about tokenization. A simple improvement (but this is discussion material), is to make all tokens lowercase. Then “Cat” and “cat” are both found when searching on “Cat”. Also the scraper has some problems. A website could implement so called spider traps, where the crawler could be trapped by following an infinite number of links. Luckily, there are many open source crawlers which have implemented features to avoid this.
If you like mathematics and if you are interested in word vectors you can read more about it here.
The full implementation of the news search engine can be found on GitHub.
Try to implement an tokenizer that makes all tokens lowercase. Also try to implement a better “find” method that can also find multiple words (hint: use the tokenizer again!).