The Information Age has flooded readers with information. However, these gains in the amount of information available are useless without parallel gains in techniques for effectively storing and searching through such information. Consider the following facts:
Amy Langville is an Assistant Professor of Mathematics at the College of Charleston. She developed this module as a postdoctoral researcher at North Carolina State University.
This sample deluge of facts shows how easy it is to feel overwhelmed by the amount of available information, an amount that is growing at an exponential rate.
The pressing need for organizing and sorting through this mushrooming amount of information presents an enormous challenge, one whose solution is of paramount importance. The entire field of information retrieval is devoted to meeting this challenge. Traditional information retrieval studies the science of search in traditional, static, well-controlled document collections. Web information retrieval deals with the world's largest document collection, the World Wide Web, a nontraditional, dynamic, uncontrolled collection.
In this module, you will learn about the field of information retrieval and the basic components of search engines. We will focus on one particular search engine model, the Vector Space Model, which incorporates basic concepts from linear algebra. We will also discuss a more advanced vector space model, Latent Semantic Indexing (LSI), which uses even more linear algebra to improve search results.
There are many search engine techniques. As of June 2000, there were at least 3,500 different search engines on the Web (Bradley, 2002). Due to the proprietary nature of the field of information retrieval, it is difficult to say which techniques are most prevalent in industrial search engines such as Yahoo, Excite, or Overture. Search engine creators loathe sharing their innovative ideas with other vendors, as this could jeopardize millions of dollars in revenue. However, without a doubt, some search engines are more successful than others, an example being Google.
The Information Age has flooded readers with information. However, these gains in the amount of information available are useless without parallel gains in techniques for effectively storing and searching through such information. Consider the following facts:
Amy Langville is an Assistant Professor of Mathematics at the College of Charleston. She developed this module as a postdoctoral researcher at North Carolina State University.
This sample deluge of facts shows how easy it is to feel overwhelmed by the amount of available information, an amount that is growing at an exponential rate.
The pressing need for organizing and sorting through this mushrooming amount of information presents an enormous challenge, one whose solution is of paramount importance. The entire field of information retrieval is devoted to meeting this challenge. Traditional information retrieval studies the science of search in traditional, static, well-controlled document collections. Web information retrieval deals with the world's largest document collection, the World Wide Web, a nontraditional, dynamic, uncontrolled collection.
In this module, you will learn about the field of information retrieval and the basic components of search engines. We will focus on one particular search engine model, the Vector Space Model, which incorporates basic concepts from linear algebra. We will also discuss a more advanced vector space model, Latent Semantic Indexing (LSI), which uses even more linear algebra to improve search results.
There are many search engine techniques. As of June 2000, there were at least 3,500 different search engines on the Web (Bradley, 2002). Due to the proprietary nature of the field of information retrieval, it is difficult to say which techniques are most prevalent in industrial search engines such as Yahoo, Excite, or Overture. Search engine creators loathe sharing their innovative ideas with other vendors, as this could jeopardize millions of dollars in revenue. However, without a doubt, some search engines are more successful than others, an example being Google.
I outline here just a few of the most basic information retrieval techniques, in order to provide a context for the techniques we will study in more detail. Specifics of some of these techniques can easily become very complicated and are, in general, hard to come by, since many vendors refuse to share in this competitive environment.
The Boolean model of information retrieval, one of the earliest and simplest retrieval methods, uses exact matching to match documents to a user "query" or information request by finding documents that are "relevant" in terms of matching the words in the query. More refined descendents of this model are still used by most libraries.
The adjective "Boolean" refers to the use of Boolean algebra, whereby words are logically combined with the Boolean operators AND, OR, and NOT. For example, the Boolean AND of two logical statements x and y means that both x AND y must be satisfied, while the Boolean OR of these same two statements means that at least one of these statements must be satisfied. Any number of logical statements can be combined using the three Boolean operators.
The Boolean information retrieval model considers which keywords are present or absent in a document or title. Thus, a document is judged as relevant or irrelevant -- there is no concept of a "partial match" between documents and queries. The inability to identify partial matches can lead to poor performance (Baeza-Yates & Ribeiro-Neto, 1999). Other more advanced set theoretic techniques, such as the so-called "fuzzy sets", try to remedy this black-white Boolean logic by introducing shades of gray. For example, a title search for car AND maintenance on a Boolean engine causes the virtual machine to return all documents that have both words in the title. As a result, an apparently relevant document entitled "Automobile Maintenance" will not be returned. Fuzzy Boolean engines use fuzzy logic to categorize this document as somewhat relevant and return it to the user.
The car maintenance query example illustrates the main drawbacks of Boolean search engines -- they fall prey to two of the most common information retrieval problems, synonymy and polysemy.
Many Boolean search engines also require the user to be familiar with Boolean operators and the specialized syntax of the engine. For example, to find information about the phrase iron curtain, many engines require quotation marks around the phrase, which tell the search engine that the entire phrase should be searched as if it were just one keyword. A user who forgets this syntax requirement would be surprised to find retrieved documents about interior decorating and mining for iron ore.
Nevertheless, variants of the Boolean model do form the basis for many search engines. There are several reasons for their prevalence:
Baeza-Yates & Ribeiro-Neto (1999), Frakes & Baeza-Yates (1992), and Korfhage (1997) all contain chapters with excellent introductions to the Boolean model and its extensions.
Another information retrieval technique uses the vector space model (Salton, 1971), developed by Gerard Salton in the early 1960's, to sidestep some of the information retrieval problems of the Boolean model. Vector space models transform textual data into numeric vectors and matrices, then employ matrix analysis techniques to discern key features and connections in the document collection.
Some advanced vector space models address the common text analysis problems of synonymy and polysemy. Models such as Latent Semantic Indexing (LSI) (Dumais, 1991) can access the hidden semantic structure in a document collection. For example, an LSI engine processing the query car will return documents whose keywords are related semantically (in meaning), e.g., automobile. This ability to reveal hidden semantic meanings makes vector space models such as LSI very powerful information retrieval tools.
Two additional advantages of the vector space model are relevance scoring and relevance feedback. The model allows documents to match a query partially by assigning each document a number between 0 and 1, which can be interpreted as the likelihood of relevance to the query. The group of retrieved documents can then be sorted by degree of relevancy, a luxury not possible with the Boolean model. Thus, vector space models return documents in an ordered list, with the first document judged to be most relevant to the user's query. Some vector space search engines report the relevance score as a relevancy percentage. For example, a 97% next to a document means that document is judged as 97% relevant to the user's query.
Example. The Federal Communications Commission's search engine, which is powered by Inktomi, was known at one time to use a vector space model. Enter a query such as taxes and notice the relevancy score reported on the right side.
Relevance feedback is an information retrieval tuning technique that is a natural addition to the vector space model. It allows the user to select a subset of the retrieved documents that are useful and then resubmit with this additional relevance feedback information to obtain a revised set of generally more useful retrieved documents.
The drawbacks of vector space models are their computational expense and poor scalability. At query time, distance measures (also known as similarity measures ) must be computed between each document and the query, and advanced models such as LSI require an expensive singular value decomposition of a large matrix that numerically represents the entire document collection. As the collection grows, the expense of this matrix decomposition becomes prohibitive.
The informative little book by Berry & Browne (1999) provides an excellent explanation of vector space models, especially LSI, and contains several examples and sample code. I provide a much more condensed explanation on the Example Search Engine Method page.
Probabilistic models attempt to estimate the probability that the user will find a particular document relevant. Retrieved documents are ranked by their odds of relevance -- the ratio of the probability that the document is relevant to the probability that the document is not relevant to the query. This model operates recursively and requires that the underlying algorithm guess at initial parameters, then iteratively try to improve this initial guess to obtain a final ranking of relevancy probabilities. Improvements to the basic probabilistic model of information retrieval are made with Bayesian inference techniques (Baeza-Yates & Ribeiro-Neto, 1999).
Unfortunately, probabilistic models can be very hard to build and program. Their complexity grows quickly, deterring many researchers and limiting the scalability of the engine based on the model. Probabilistic models also require several unrealistic simplifying assumptions, such as independence between terms as well as between documents. For instance, in this document the most likely word to follow information is the word retrieval, so it is not reasonable to assume that these terms are independent.
On the other hand, the probabilistic framework can naturally accommodate a priori preferences, and thus these models do offer promise of tailoring search results to the preferences of individual users. For example, a user's query history can be incorporated into the probabilistic model's initial guess, which generates better query results than a democratic guess.
Meta-search engines are based on the principle that while one search engine is good, two (or more) are better. One search engine may be great at a certain task, while a second search engine is better at a another task. Thus, meta-search engines such as Copernic and SurfWax were created to simultaneously exploit the best features of many individual search engines. Meta-search engines send the user's query to several search engines at once and return the results from all of the engines in one long unified list. Some meta-search engines also include subject-specific search engines, which can be helpful when searching within one particular discipline. For example, Monster is an employment search engine.
Furthermore, some advertisers try to increase traffic to their pages by taking elaborate measures to fool automated search engines. For example, an advertiser might label sports, beer, and swimsuits as keywords in its subject tag, thinking these topics are likely to arouse Web surfers' interests, while the true content of the page is classical clocks, a much less alluring topic. In fact, there are even companies whose sole purpose and means of profit is the manipulation of search engines. There are also search engines whose owners sell companies the right to be listed at the top of the retrieved documents list for particular queries (Marchiori, 1997).
There is an additional Web information retrieval challenge related to precision. Although the amount of accessible information continues to grow, user ability to look at documents has not. Users rarely look beyond the first 10 or 20 documents retrieved, and this user impatience means that search engine precision must increase just as rapidly as the number of documents is increasing.
Broken links also pose a special problem. Often the most appealing retrieved document turns out to be a broken link, which quickly causes user frustration. Most search engines use a "web crawling" technique to gather and store information about individual Web pages and documents, thus, in effect, removing broken links from their database of retrievable documents.
Quiz: Test your reading comprehension
Quiz: Test your Web search engine skills
Of the basic models of information retrieval, we focus in this project on the Vector Space Model (VSM) because it has the strongest connection to linear algebra. This model and its more advanced version, Latent Semantic Indexing (LSI), are beautiful examples of linear algebra in practice.
Let's begin with a trivial example. Consider the database of articles for a hypothetical periodical called Food! magazine. The articles are distinguishable by only three keywords: desserts, breads, vegetables. One article (document 1) talks only about desserts. Another article (document 2) discusses only breads. A third article (document 3) is 30% about vegetables and 70% about breads. All divisions of article space among the three keywords are possible. Vectors and matrices are used to describe how a vector space search engine processes a query in the Food! database.Note: There are numerous matrix element weighting schemes for VSM methods. Here we are using the proportion of the document dealing with each corresponding term, but various other weights, such as the frequency of a term's usage in a document, could be used (Jones & Furnas, 1987).
A is a 3-by-7 matrix. If there were, say, 20 keywords and 1000 documents, then A would be a 20-by-1000 matrix. A is called a term-by-document matrix because the rows correspond to terms (keywords) in the database and the columns correspond to documents. Since there are only three terms in our Food! example, we can plot each column (document) vector in the matrix A in three-dimensional space. A static plot of the seven document vectors is shown in Figure 1 -- to see an animated version, click on the figure. When you are ready to return here, close the animation window.
Let's analyze this plot to see if it gives us any insight into the information retrieval problem of searching the Food! database to find the document closest to a given query. Suppose a user of this Food! search engine wants to find all articles related to a query of vegetables. Then our query vector is Now the problem is mathematical: Find which documents represented by the vectors (d1, d2, d3, d4, d5, d6, d7) are "closest" to this query vector q. You might try to inspect the 3-D plot of the document vectors and the query vector visually (Figure 2) to see which document vectors are "closest." This visual inspection can be difficult, since the display of a 3-D plot must be done on a 2-D screen -- but, as with Figure 1, you can click on the figure to see an animated version. Can you guess which document vectors are closest to the query vector? Why did you make that choice?
The information contained in a multi-dimensional vector can be summarized in two one-dimensional measures, length and angle with respect to a fixed direction. The length of a vector is the distance from the tail to the head of the vector. The angle between two vectors is the measure (in degrees or radians) of the angle between those two vectors in the plane that they determine. Keeping the length and angle (relative to the query vector) of the Food! vectors in mind, try to answer this question:
Visually, what is it about document 4's vector that makes it most similar to the query vector?
Clearly, it is not the length of the vector -- document 1's vector has the same length as the query vector, but it does not appear to be physically "closest" to the query vector. Rather, the angle of the vector with the query vector tells us that document 4's vector is closest to the query vector. For each document in the Food! database, we can use one number, the angle between the document vector and the query vector, to capture the physical "distance" of that document from the query. The document vector whose direction is closest to the query vector's direction (i.e., for which the angle is smallest) is the best choice, yielding the document most closely related to the query.
We can compute the cosine of the angle θ between the nonzero vectors x and y by
If the cosine of an angle is 1, then the angle between the document vector and the query vector measures 0°, meaning the document vector and the query vector point in the same direction. A cosine measure of 0 means the document is unrelated to the query, i.e., the vectors are perpendicular. Thus, a cosine measure close to 1 means that the document is closely related to the query.
Note: Since all entries in the vectors are nonnegative, the dot products are nonnegative, which means there is no need to take absolute value of the cosine measure.
We now consider a second query, this time desiring articles with an equal mix of information about desserts and breads, so the new query vector is We show the cosines of the angles between the document vectors and this query vector is in Table 2.
We see that document 6 is now most relevant to the query -- as expected, since d6 is identical to query vector 2 in the divisions of topics discussed. Document 5 is the second most relevant document to query 2, followed by document 7.
In practice, a search engine system must designate a cutoff value. Only those documents whose angles with the query vector are above this cutoff value are reported in the list of retrieved documents viewed by the user. In Figure 3 we show (in blue) the cone of all vectors whose angle with the query vector is 45°. The cosine of this angle is 0.7071, and thus, if we want to return only documents whose vectors lie inside (or on) the cone, the cutoff value is 0.7071. (As before, click on the figure to see an animated version.)
Now let's review how the VSM processes query q. First, the cosine of the angle measure between each di and q must be computed. Those distance measures exceeding some predetermined cutoff value are ranked, and the ranked list is presented to the user. For practical collections, much larger than this trivial 3-term-7-document collection, computing these distance measures is the main computational expense.
Exercise: Create Your Own 3-D Physical Search Engine
The concept of angles in 3-D can be extended to general n-dimensional (n-D) spaces. In n-D spaces, each vector has n components. Again, we want to summarize the information in an n-D vector into one number, so that we can quickly compare the information contained in any set of vectors. Even though n-D vectors are impossible to visualize for n > 3, analogous concepts of length and angle exist in n-D space just as they do in 3-D space. This mathematical extension enables us to talk about a real search engine, such as Inktomi.
There are about 300,000 words in the English language (Berry, Drmac & Jessup, 1999). Thus, the query and document vectors are potentially 300,000-D vectors. For example, a search engine query on beach volleyball would create the 300,000-D query vector
where the first "1" corresponds to the English word beach and the second "1" corresponds to the word volleyball.
Each component in the 300,000-D vector corresponds to a word in the English language. Thus, the first position in the vector corresponds to the first word in the English dictionary, aardvark, and the last position corresponds to zymurgy, the last word in the dictionary. If the fourth position contains a 1, then a search should be run for the fourth word in the English language. Similarly, a 0 in the fourth position means the fourth word is irrelevant to the search. In general, if the ith position in the vector contains a 1, then the ith word in the English language should be searched for. On the other hand, if the ith position contains a 0, the ith word in the English language is irrelevant to the search. This is the procedure for translating a group of query words into an equivalent mathematical representation using the query vector.
The same is done for the documents on the Web, translating them into mathematical representations. Each document on the Web has a list of keywords associated with it. These keywords are used to create a document vector for that document. For example, a document entitled "Will Shaq's Free Throws Stop the Lakers" might designate basketball, playoffs, and NBA as its keywords. This document's vector contains three 1's in the positions corresponding to its three keywords and 0's elsewhere.
Search engines often apply weighting schemes to keywords. The scheme just described is unweighted, since each keyword gets the value 1 in the document vector. However, in a weighted scheme, the NBA playoff document might give more weight to playoff than basketball, and thus the entry corresponding to playoff might be 5 and the entry for basketball might be 2.
You have already encountered a keyword weighting scheme in the Food! collection example, where the weights were percentages of the article devoted to a particular keyword. In practice, numerous clever weighting and frequency schemes are implemented in an effort to make the search more accurate (Berry & O'Brien, 1998, Dumais, 1991, Jones, 1972). In addition, various other 1-D measures besides the cosine angle measure for capturing the distance between two vectors have been proposed (Jones & Furnas, 1987).
This n-D VSM must proceed through the same steps as the 3-D VSM example (page 5) in order to process a query. First, the angle measure between q and each document vector di in the collection must be computed. With three billion documents of the Web, this means three billion angle measures must be calculated for each query. Then those angle measures exceeding the cutoff value are ranked, and this ranked list is presented to the user. And all this must be done in a fraction of a second, which leads us to the next page.
The vector space model provides the framework for most information retrieval algorithms used today. However, this most basic vector space model alone is not efficient enough. Many modifications and heuristics have been invented to speed up the basic model, giving rise to a popular model called the Latent Semantic Indexing (LSI) model (Berry, Drmac & Jessup, 1999).
The major mathematical entity in the LSI model is also a matrix. We have already discussed how the vector space model described in preceding pages can be viewed as a matrix. Now let's adjoin the 3 billion document vectors associated with the Web (each of which is a 300,000-D vector) to one another to form a 300,000-by-3,000,000,000 matrix, which is called the term-by-document matrix. This matrix might look like
where the (i,j)-element of the matrix contains a 1 if the word for position i in document j is a keyword and a 0 otherwise. Column 1 in this matrix is document 1's document vector, column 2 is document 2's vector, and so forth.
This huge term-by-document matrix may contain redundant information. Berry, Drmac & Jessup (1999) provide the following example:
There may be a document on the Web with keywords applied linear algebra and another document on the Web with keywords computer graphics. There might also be a document with keywords linear algebra applied [to] computer graphics. Clearly, the vectors for the first two documents sum to the vector for the third document.
In linear algebra, this is called dependence among columns of the matrix. This dependence among columns may be frequent, since there are so many documents on the Web. Fortunately, this dependence in the term-by-document matrix can be used to reduce the work required in information retrieval.
Here is where the LSI model departs from the vector space model. In most general terms, the LSI method manipulates the matrix to eradicate dependencies and thus consider only the independent, smaller part of this large term-by-document matrix. In particular, the mathematical tool used to achieve the reduction is the truncated singular value decomposition (SVD) of the matrix.
Note: We introduce the SVD briefly in what follows; Berry, Drmac & Jessup (1999) and Meyer (2000) contain detailed explanations of the SVD of a matrix.
Basically, a truncated SVD of the term-by-document matrix reduces the m-by-n matrix A to an approximation Ak that is stored compactly with a smaller number of vectors. The subscript k represents how much of the original matrix A we wish to preserve. A large value of k means that Ak is very close to A. However, the value of this good approximation must be weighed against the desire for data compression. In fact, the truncated representation Ak requires the storage of exactly k scalars, k m-dimensional vectors and k n-dimensional vectors. Clearly, as k increases, so do the storage requirements.
Let's pause now for a formal definition of the SVD, followed by a numerical example. Before defining the SVD, we also review some other matrix algebra terms. The following definitions and theorems are taken from Meyer (2000).
Definition The rank of an m-by-n matrix A is the number of linearly independent rows or columns of A.
Definition An orthogonal matrix is a real m-by-n matrix P of rank r whose columns (or rows) constitute an orthonormal basis for Rr. That is, the set of column (or row) vectors {p1, p2, …, pr}, where r = rank(P), has the property that piTpj = 1 for i = j, 0 otherwise.
Theorem For each m-by-n matrix A of rank r, there are orthogonal matricesUm-by-m and Vn-by-n
and a diagonal matrix
Dr-by-r = diag(σ1, σ2, …, σr)
such that
with σ1 ≥ σ2 ≥ … ≥ σr > 0.
Definition The σi's in the preceding theorem are called the nonzero singular values of A. When r < p = min(m,n), A is said to have p - r additional zero singular values. The factorization in the theorem is called the singular value decomposition (SVD) of A, and the columns of U (denoted ui) and the columns of V (denoted vi) are called the left-hand and right-hand singular vectors of A, respectively. Written another way, the SVD expresses A as a sum of rank-1 matrices:
.
Definition The truncated SVD is
Theorem (Meyer, 2000, p. 417) If σ1 ≥ σ2 ≥ … ≥ σr are the nonzero singular values of Am-by-n, then for each k, the distance from A to the closest rank-k matrix is
Thus, σk+1 is a measure of how well Ak approximates A. In fact, for Am-by-n of rank r, Ak is the best rank-k matrix approximation to A (Eckart & Young, 1936). The goal in using Ak to approximate A is to choose k large enough to give a good approximation to A, yet small enough so that k << r requires much less storage.
Figure 6 shows the storage savings obtained with the truncated SVD in general.
A numerical example may make these concepts clearer. Suppose the original term-by-document matrix A is 100-by-300. After a truncated SVD, suppose A20 approximates A well, i.e., we take k = 20. To calculate the storage savings, we count the number of elements that must be stored in both representations -- the full 100-by-300 matrix A (30,000 elements) versus the truncated SVD representation, which requires storage of a 20-by-1 vector, a 100-by-20 matrix and a 20-by-300 matrix (20 × 1 + 100 × 20 + 20 × 300 = 8,020 elements). The number of elements in the full matrix A representation is almost 4 times greater than the number of elements required by the truncated SVD.
Larger data collections often have much greater storage savings. The Medlars collection uses a 5831 × 1033 term-by-document matrix, which can be very well approximated by the truncated SVD with k = 70, creating storage savings about 13 times less than the original matrix.
We close this section with a small example, taken from Berry & Browne (1999), showing that the truncated SVD can well approximate a matrix. Suppose A is the 9 × 7 matrix
As k increases, the truncated SVD better captures the essence of A, yet the storage savings decrease. For example, here is A6:
Notice that A6 does a much better job of capturing the relationship between the vectors of A than A3 does. Now compare the ranking of document vectors using A with the ranking obtained using A6. Also compare the ranking from A3 with that from A6.
since ||UksiT||2= ||siT||2 for the orthogonal matrix Uk, and ||siT||2=||si||2.
It's helpful to review the above equation, emphasizing the advantages of the truncated SVD. First, Uk, si and ||si||2 only need to be computed once and can be reused for each query q. At runtime, only UkTq and ||q||2 must be computed. The vectors in Uk and Vk are computed once, and these can be determined without computing the entire SVD of A, since there is an algorithm for computing the first k singular values and vectors iteratively (Golub & Van Loan, 1996).
In practice, k can be much smaller than r and yet Ak still approximates A well. One reason for this concerns the noise in A due to polysemous and synonymous words in the English language. Truncating the SVD reduces this noise. Another reason why k << r works well in information retrieval applications is that numerical accuracy is of little concern. For example, three digits of precision in the cosine measure may be sufficient. That is, a cosine measure of .231 is just as meaningful as the more numerically precise result of .23157824. The truncated SVD enables storage benefits and still maintains adequate accuracy in the angle measure calculations.
Exercise: Experiment with an LSI Search Engine built in Matlab
Many search engines are beginning to incorporate structured text retrieval, meaning that users can search for text in italics, capitals, or boldface print. Google currently incorporates some structured retrieval. Users can choose to search for keywords located in the title, body, or hyperlinks of documents. Google's advanced features include searching for HTML documents, PostScript documents, or Excel documents, among other choices. Hotbot allows users to search for images, video, or MP3-formatted data, in addition to ordinary text. Its advanced features also include date information. For example, users can choose to retrieve only recent documents, those updated in the last two weeks, two months, or two years.
Another recent trend in search engine research is to use syntactic clues and query structure to improve retrieval results. These methods incorporate the location of the keywords in the query, the use of prepositions, verbs, and adjectives, hyperlink structure on document pages, and citation links to enhance retrieval. Other research efforts include accommodations for errors and typos in the query, using the concepts of pattern matching and edit distance.
Most keyword search engines, such as the library search engine for N. C. State University, allow for wildcards. For example, you can enter computer * to retrieve books about computer-aided analysis, the computer age, computer circuits, and many other such topics.
One of the most exciting applications of information retrieval is image retrieval. True image retrieval uses mathematical models to capture how similar images are to an original image without using text hints, such as HTML image tags. Soon users will be able to retrieve images in addition to text documents. One application of image retrieval is locating possible suspects for a crime by searching through a criminal photo ID database electronically to find those criminals whose photo profiles most completely match the sketch rendered by the crime scene artist. This can be further extended to sound retrieval. Searching through audio files is as challenging as image retrieval, but may be possible in the near future.
To learn more, explore the online resources and print references listed on the next page.