The Linear Algebra Behind Search Engines - The Challenges of Web Information Retrieval

Amy Langville
New information retrieval methods, in addition to those described above, are often used to search the Web. The Web's hyperlink structure, a structure not present in a standard information retrieval repository of documents, provides additional information that can be exploited in the retrieval process. Google was one of the first commercial search engines to recognize the importance of the Web's hyperlink structure. The underlying model for the successful Google engine is a Markov chain. (See Langville & Meyer, 2005 for an introduction to web retrieval methods.)

There are other information retrieval challenges in addition to those briefly mentioned already, such as polysemy, synonymy, query language, and speed. The Web presents its own unique challenges -- it can be viewed as one huge database with these unique properties:

  • large amounts of volatile data (rapid updates, broken links, file disappearances),
  • an exponentially growing number of pages,
  • heterogeneous data (multiple formats, languages, alphabets),
  • lack of structure,
  • redundant data,
  • a lack of an editorial review publication process, which leads to numerous information errors, falsehoods, and invalid statements.

Furthermore, some advertisers try to increase traffic to their pages by taking elaborate measures to fool automated search engines. For example, an advertiser might label sports, beer, and swimsuits as keywords in its subject tag, thinking these topics are likely to arouse Web surfers' interests, while the true content of the page is classical clocks, a much less alluring topic. In fact, there are even companies whose sole purpose and means of profit is the manipulation of search engines. There are also search engines whose owners sell companies the right to be listed at the top of the retrieved documents list for particular queries (Marchiori, 1997).

There is an additional Web information retrieval challenge related to precision. Although the amount of accessible information continues to grow, user ability to look at documents has not. Users rarely look beyond the first 10 or 20 documents retrieved, and this user impatience means that search engine precision must increase just as rapidly as the number of documents is increasing.

Broken links also pose a special problem. Often the most appealing retrieved document turns out to be a broken link, which quickly causes user frustration. Most search engines use a "web crawling" technique to gather and store information about individual Web pages and documents, thus, in effect, removing broken links from their database of retrievable documents. 

Quiz: Test your reading comprehension

Quiz: Test your Web search engine skills