The Mathematics behind Search Engines
In this short post, I will give an overview of some technology that is used by well-known search engines.
Introduction
Chances are high that you found this webpage by using a search engine. Think about that (and imagine the following if you ended up here by other means). A couple of seconds ago, you thought of some keywords, you entered them in a search engine and you clicked a link. The search engine provided you this webpage in just a few milliseconds! What is the mechanism that enabled such a high speed?
Information retrieval
For the described search task, the search engine uses techniques developed in a field called Information Retrieval. Now imagine you are typing in the words “Leonardo da Vinci” in your favorite search engine. Many search engines display information about the painter (like his birth and death) but also information about related concepts. One core concept you find in almost any search engine nowadays, is a so-called Knowledge Graph. In this graph, entities like “Leonardo da Vinci”, his famous “Mona Lisa” and other artists like “Vincent van Gogh” are interconnected. Entities are the smallest possible things that can not be divided in smaller parts. Each of these entities has attributes (like the date of birth and a name) but these attributes are different for different entities. For example, painters have a date of birth. But buildings like “The White House” probably does not have a date of birth. Furthermore, these entities also have types. “Vincent van Gogh” and “Leonardo da Vinci” both have a type “Person”, but “The White House” does not belong to that type. In summary, a Knowledge Graph consists of nodes (entities) and arcs (relations between entities) and the entities have attributes and types.
Knowledge graph extraction is done in a subfield called Natural Language Processing. In this field, there are many topics. Bayes’ rule is a famous formula and is a naive approach for many tasks. One such task is spam detection. In this article, you will learn how to build your own spam detector using Python (and Bayes’ rule). A different topic is the topic of spelling correction. In this article you will learn more on spelling correction.
It is even possible to use different representation schemes in which less or more features are used for entities. One very simplistic model only contains entities and relations among the entities. Here, attributes and types are discarded.
In the next weeks, I will post more information about Knowledge Graphs and how these graphs can be utilized for practical applications. If you have any questions or if you would like more information about some topic related to Knowledge Graphs, please let me know!