Building a news search engine

News papers.

In this article, we will build a basic news search engine that is capable of finding news by keywords. Since this is a complex system, I will first split the system up into smaller modules. The first module is the module that retrieves all news from the internet. This module is called a scraper (or web scraper) and is written in Python. It maintains a file called the index. This is a file that contains a list of documents per keywords. For example, several documents contain the term “music”, so the index contains the term “music” and a list with references to all documents that contain the word “music”. But we will first start with our scraper.

Web scraper

For the web scraper, we will use a queue which contains pages which are about to be scraped. Once these pages are handled, these are put onto a list such that we can ensure that pages are only handled once. There are lots of difficulties which you will encounter during web scraping. One of the issues is that URLs with a hash are (statically) equivalent to URLs without a hash. For example, http://domain.com/#hashpart is statically the same as http://domain.com/. This hashpart can be removed easily:

def url_remove_hash(url):
        """
        Remove the hash part from an URL:

        http://domain.com/#hashpart --> http://domain.com/

        :param url: URL to remove the hash part from.
        :return: URL without hash part.
        """
        if '#' not in url:
            return url
        hash_index = url.index('#')
        return url[:hash_index]

Another issue is that there are two types of links: links are either relative or absolute. Relative means that the link does not start with the full domain name. For example, http://domain.com/page is an absolute URL and /page is a relative URL. If the base URL was http://domain.com/ then the absolute URL and the relative URL are the same. If the base URL would be http://test.com/, then the absolute URL using this base URL would be http://test.com/page. Making relative links absolute is done using the following method:

def compute_absolute_urls(base_url, urls):
        """
        Compute absolute URLs.

        Given base URL 'http://domain.com/', the path /123 will become 'http://domain.com/123'.

        :param base_url: The prefix of all relative URLs.
        :param urls:     A list containing absolute and relative URls.
        :return:         A list containing absolute URLs.
        """
        new_urls = set()
        if len(base_url) > 0 and base_url[-1] is not '/':
            base_url += '/'
        for url in urls:
            if url is not None and len(url) > 0:
                if '://' in url:
                    new_urls.add(url)
                else:
                    while len(url) > 0 and url[0] is '/':
                        url = url[1:]
                    new_urls.add(base_url + url)
        return list(new_urls)

Suppose we want to scrape http://domain.com/ and http://www.test.com/. Then we don’t want links starting with http://external.com/. http://domain.com/ and http://www.test.com/ are called base URLs. The following code goes through a given list of URLs and checks whether they are internal, i.e. they are starting with URLs which are given in a list called base_urls:

def remove_external_urls(base_urls, urls):
        """
        Remove URLs which are not internal.

        For example, if base_urls = ['http://domain.com', 'http://domain2.com'],
        then 'http://domain.com/page1' is internal (because there is a base URL 'http://domain.com'.
        'http://domain3.com/pageX' is external (because there is no base URL).

        :param base_urls: List of base URLs.
        :param urls:      List of URLs to filter.
        :return: Filtered list of URLs.
        """
        internal_urls = set()
        for url in urls:
            for base_url in base_urls:
                if base_url in url and url.index(base_url) is 0:
                    internal_urls.add(url)
        return internal_urls

The implementation for the queue is straightforward. All of the functionality is combined into one class called Scraper:

class Scraper:

    def __init__(self, base_urls, max_urls):
        """
        Initialize the scraper.

        :param base_urls: A list of base URLs from which we will scrape documents.
        :param max_urls:  The maximum number of documents to parse.
        :return:
        """

        # The maximum number of urls and the number of parsed urls
        self.max_urls = max_urls
        self.num_urls_parsed = 0

        # A list containing all urls which are already handled
        self.done = set()

        # The list of base urls and a queue which will be used during the parsing phase
        self.base_urls = base_urls
        self.queue = []
        self.add_to_queue(self.base_urls)

    def scrape(self):
        """
        The scrape function will scrape the links which are intern with respect to the base URLs.
        """
        while len(self.queue) > 0 and self.num_urls_parsed < self.max_urls:
            url = self.queue.pop(0)
            html = self.fetch_html(url)
            urls = self.fetch_urls(html)
            urls = self.compute_absolute_urls(url, urls)
            urls = set(self.url_remove_hash(url) for url in urls)
            urls = self.remove_external_urls(self.base_urls, urls)
            self.add_to_queue(urls)
            self.sort_queue()
            self.num_urls_parsed += 1
            print("Scraping... %s / %s (%s)" % (self.num_urls_parsed, self.max_urls, url))
        pass

    @staticmethod
    def url_remove_hash(url):
        """
        Remove the hash part from an URL:

        http://domain.com/#hashpart --> http://domain.com/

        :param url: URL to remove the hash part from.
        :return: URL without hash part.
        """
        if '#' not in url:
            return url
        hash_index = url.index('#')
        return url[:hash_index]

    @staticmethod
    def remove_external_urls(base_urls, urls):
        """
        Remove URLs which are not internal.

        For example, if base_urls = ['http://domain.com', 'http://domain2.com'],
        then 'http://domain.com/page1' is internal (because there is a base URL 'http://domain.com'.
        'http://domain3.com/pageX' is external (because there is no base URL).

        :param base_urls: List of base URLs.
        :param urls:      List of URLs to filter.
        :return: Filtered list of URLs.
        """
        internal_urls = set()
        for url in urls:
            for base_url in base_urls:
                if base_url in url and url.index(base_url) is 0:
                    internal_urls.add(url)
        return internal_urls

    @staticmethod
    def compute_absolute_urls(base_url, urls):
        """
        Compute absolute URLs.

        Given base URL 'http://domain.com/', the path /123 will become 'http://domain.com/123'.

        :param base_url: The prefix of all relative URLs.
        :param urls:     A list containing absolute and relative URls.
        :return:         A list containing absolute URLs.
        """
        new_urls = set()
        if len(base_url) > 0 and base_url[-1] is not '/':
            base_url += '/'
        for url in urls:
            if url is not None and len(url) > 0:
                if '://' in url:
                    new_urls.add(url)
                else:
                    while len(url) > 0 and url[0] is '/':
                        url = url[1:]
                    new_urls.add(base_url + url)
        return list(new_urls)

    @staticmethod
    def fetch_html(url):
        """
        Fetch HTML from a given URL.

        :param url: URL to fetch HTML for.
        :return: HTML.
        """
        import urllib.request
        try:
            html = urllib.request.urlopen(url).read()
            html = html.decode('utf-8')
        except urllib.error.HTTPError:
            html = ''
        return html

    @staticmethod
    def fetch_urls(html):
        """
        Fetch URLs from HTML.

        :param html: HTML to fetch URLs from.
        :return: List of URLs.
        """
        from bs4 import BeautifulSoup
        urls = []
        soup = BeautifulSoup(html, 'lxml')
        for link in soup.findAll('a'):
            urls.append(link.get('href'))
        return urls

    @staticmethod
    def fetch_text(html):
        """
        Fetch all text found inside HTML.

        :param html: HTML to find text in.
        :return: Found text.
        """
        import html2text
        return html2text.html2text(html)

    def sort_queue(self):
        """
        Sort the queue such that shorter URLs are handled first.
        """
        self.queue = sorted(self.queue, key=lambda url: len(url))

    def add_to_queue(self, urls):
        """
        Add URLs to the queue.

        :param urls: URLs to add.
        """
        for url in urls:
            if url not in self.done:
                self.queue.append(url)
                self.done.add(url)

Now it is easy to call the scraper:

# Scrape 5 pages from two news websites
scraper = Scraper(['http://www.bbc.com/', 'http://phys.org/'], 5)
scraper.scrape()

This will give the following output:

Scraping... 1 / 5 (http://www.bbc.com/)
Scraping... 2 / 5 (http://phys.org/)
Scraping... 3 / 5 (http://phys.org/help/)
Scraping... 4 / 5 (http://www.bbc.com/tv/)
Scraping... 5 / 5 (http://phys.org/feeds/)

Now we have the tools to scrape pages. The next step is to make the found text searchable and this will be done in our search module.

 

Search module (a.k.a. Indexer)

Now we will create our search module. In fact, this is an indexer (actually, an inverse index is build). An index can be found in almost every book. At the end of a book is a large list of terms referring to pages where the word occurs. The indexer does the same. It makes a list of words an refers to documents where words occur in. And actually, only a few things have to occur. We have to extract tokens from a text. To simplify things, we refer to tokens here as words which are separated by spaced. The text is first tokenized (in other words, the tokens are extracted) and for every token, an entry is added to the index to the corresponding URL. This results in the following code:

class Indexer:

    def __init__(self):
        self.data = {}

    def index(self, url, text):
        tokens = self.get_tokens(text)
        for token in tokens:
            if token not in self.data:
                self.data[token] = []
            if url not in self.data[token]:
                self.data[token].append(url)

    def find(self, token):
        if token in self.data:
            return self.data[token]
        return []

    @staticmethod
    def get_tokens(text):
        return text.split()

Combining the modules

In order to make the two modules work together, a slight modification must be made to our Scraper module. Change the scrape method to the following:

def scrape(self, indexer):
        """
        The scrape function will scrape the links which are intern with respect to the base URLs.
        """
        while len(self.queue) > 0 and self.num_urls_parsed < self.max_urls:
            url = self.queue.pop(0)
            html = self.fetch_html(url)
            urls = self.fetch_urls(html)
            urls = self.compute_absolute_urls(url, urls)
            urls = set(self.url_remove_hash(url) for url in urls)
            urls = self.remove_external_urls(self.base_urls, urls)
            text = self.fetch_text(html)
            indexer.index(url, text)
            self.add_to_queue(urls)
            self.sort_queue()
            self.num_urls_parsed += 1
            print("Scraping... %s / %s (%s)" % (self.num_urls_parsed, self.max_urls, url))

And now everything is working together! In order to find sport related URLs, you just have to use the following piece of code:

# Setup the indexer
indexer = Indexer()

# Scrape some pages from two news websites
scraper = Scraper(['http://www.bbc.com/', 'http://phys.org/'], 20)
scraper.scrape(indexer)

# Display all found sport articles
print('Found sport documents:')
for url in indexer.find('sport'):
    print(url)

The result was the following:

Scraping... 1 / 20 (http://www.bbc.com/)
Scraping... 2 / 20 (http://phys.org/)
Scraping... 3 / 20 (http://phys.org/help/)
Scraping... 4 / 20 (http://www.bbc.com/tv/)
Scraping... 5 / 20 (http://phys.org/feeds/)
Scraping... 6 / 20 (http://www.bbc.com/news)
Scraping... 7 / 20 (http://www.bbc.com/cbbc)
Scraping... 8 / 20 (http://phys.org/weblog/)
Scraping... 9 / 20 (http://phys.org/search/)
Scraping... 10 / 20 (http://www.bbc.com/sport)
Scraping... 11 / 20 (http://www.bbc.com/urdu/)
Scraping... 12 / 20 (http://www.bbc.com/food/)
Scraping... 13 / 20 (http://www.bbc.com/autos)
Scraping... 14 / 20 (http://www.bbc.com/arts/)
Scraping... 15 / 20 (http://www.bbc.com/earth)
Scraping... 16 / 20 (http://www.bbc.com/news/)
Scraping... 17 / 20 (http://phys.org/archive/)
Scraping... 18 / 20 (http://www.bbc.com/cbbc/)
Scraping... 19 / 20 (http://www.bbc.com/local/)
Scraping... 20 / 20 (http://www.bbc.com/hausa/)

Found sport articles:
http://www.bbc.com/sport

So, now our basic news search engine indeed has found the correct URL!

Improvements

What are the next steps? Consider the tokenizer. If you really want to do a good job, dive into a text mining book and learn about tokenization. A simple improvement (but this is discussion material), is to make all tokens lowercase. Then “Cat” and “cat” are both found when searching on “Cat”. Also the scraper has some problems. A website could implement so called spider traps, where the crawler could be trapped by following an infinite number of links. Luckily, there are many open source crawlers which have implemented features to avoid this.

Implementation

The full implementation of the news search engine can be found on GitHub.

Exercises

Try to implement an tokenizer that makes all tokens lowercase. Also try to implement a better “find” method that can also find multiple words (hint: use the tokenizer again!).

Kevin Jacobs

Kevin Jacobs

Kevin Jacobs is a certified Data Scientist and blog writer for Data Blogger. He is passionate about any project that involves large amounts of data and statistical data analysis. Kevin can be reached using Twitter (@kmjjacobs), LinkedIn or via e-mail: mail@kevinjacobs.nl.