Arden Dertat » Search Engines Information Retrieval and Machine Learning Mon, 11 Mar 2013 06:32:35 +0000 en-US hourly 1 http://wordpress.org/?v=3.5.1 Implementing Search Engines /2012/01/11/implementing-search-engines/?utm_source=rss&utm_medium=rss&utm_campaign=implementing-search-engines /2012/01/11/implementing-search-engines/#comments Wed, 11 Jan 2012 10:31:05 +0000 Arden /?p=1064 Continue reading ]]> Welcome to my ‘how to implement a search engine’ series. I describe how to implement an actual search engine with working code in python. Here you can find pointers to original detailed articles. I will continue to share in this area.

1. Create Index
Building the index of our search engine.

2. Query Index
Answering search queries on the index that we built.

3. Ranking
Ranking the search results.

]]>
/2012/01/11/implementing-search-engines/feed/ 3
How to Implement a Search Engine Part 3: Ranking tf-idf /2011/07/17/how-to-implement-a-search-engine-part-3-ranking-tf-idf/?utm_source=rss&utm_medium=rss&utm_campaign=how-to-implement-a-search-engine-part-3-ranking-tf-idf /2011/07/17/how-to-implement-a-search-engine-part-3-ranking-tf-idf/#comments Sun, 17 Jul 2011 22:50:32 +0000 Arden /?p=271 Continue reading ]]> Overview
We have come to the third part of our implementing a search engine project, ranking. The first part was about creating the index, and the second part was querying the index. We basically have a search engine that can answer search queries on a given corpus, but the results are not ranked. Now, we will include ranking to obtain an ordered list of results, which is one of the most challenging and interesting parts. The first ranking scheme we will implement is . In the following articles, we’ll analyze , which is a variant of tf-idf. We will also implement Google’s . Then we will explore Machine Learning techniques such as , (SVM), ,  and so forth.

Tf-idf is a weighting scheme that assigns each term in a document a weight based on its term frequency (tf) and inverse document frequency (idf).  The terms with higher weight scores are considered to be more important. It’s one of the most popular weighting schemes in Information Retrieval.

Term Frequency – tf
Let’s first define how term frequency is calculated for a term t in document d. It is basically the number of occurrences of the term in the document.

tf_{t,d} = N_{t,d}

We can see that as a term appears more in the document it becomes more important, which is logical. However, there is a drawback, by using term frequencies we lose positional information. The ordering of terms doesn’t matter, instead the number of occurrences becomes important. This is known as the , and it is widely used in document classification. In bag of words model, the document is represented as an unordered collection of words. However, it doesn’t turn to be a big loss. Of course we lose the semantic difference between “Bob likes Alice” and “Alice likes Bob”, but we still get the general idea.

We can use a vector to represent the document in bag of words model, since the ordering of terms is not important. There is an entry for each unique term in the document with the value being its term frequency. For the sake of an example, consider the document “computer study computer science”. The vector representation of this document will be of size 3 with values [2, 1, 1] corresponding to computer, study, and science respectively. We can indeed represent every document in the corpus as a k-dimensonal vector, where k is the number of unique terms in that document. Each dimension corresponds to a separate term in the document. Now every document lies in a common vector space. The dimensionality of the vector space is the total number of unique terms in the corpus. We will further analyze this model in the following sections. The representation of documents as vectors in a common vector space is known as the  and it’s very fundamental to information retrieval. It was introduced by , a pioneer of information retrieval. Google’s core ranking team is led by , who was the PhD student of Salton at Cornell University.

While using term frequencies if we use pure occurrence counts, longer documents will be favored more. Consider two documents with exactly the same content but one being twice longer by concatenating with itself.  The tf weights of each word in the longer document will be twice the shorter one, although they essentially have the same content. To remedy this effect, we length term frequencies. So, the term frequency of a term t in document D now becomes:

tf_{t,d} = \dfrac{N_{t,d}}{||D||}

||D|| is known as the and is calculated by taking the square of each value in the document vector, summing them up, and taking the square root of the sum. After normalizing the document vector, the entries are the final term frequencies of the corresponding terms. The document vector is also a , having a length of 1 in the vector space.

Inverse Document Frequency – idf
We can’t only use term frequencies to calculate the weight of a term in the document, because tf considers all terms equally important. However, some terms occur more rarely and they are more discriminative than others. Suppose we search for articles about computer vision. Here the term vision gives us more information about the intent of the query, instead of the term computer. We don’t simply want articles that are about computers, we want them to be about vision. If we purely use tf values then the term computer will dominate because it’s a more common term than vision, and the articles containing computer will be ranked higher. To mitigate this effect, we use inverse document frequency. Let’s first see what document frequency is. The document frequency of a term t is the number of documents containing the term:

df_t = N_t

Note that the occurrence counts of the term in the individual documents is not important. We are only interested in whether the term is present in a document or not, without taking into consideration the counts. It’s like a binary 0/1 counting. If we were to consider the number of occurrences in the documents, then it’s called collection frequency. But document frequency proves to be more accurate. Also note that term frequency is a document-wise statistic while document frequency is collection-wise. Term frequency is the occurrence count of a term in one particular document only; while document frequency is the number of different documents the term appears in, so it depends on the whole corpus. Now let’s look at the definition of inverse document frequency. The idf of a term is the number of documents in the corpus divided by the document frequency of a term. Let’s say we have N documents in the corpus, then the inverse document frequency of term t is:

idf_t = \dfrac{N}{df_t} = \dfrac{N}{N_t}

This is a very useful statistic, but it also requires a slight modification. Consider a corpus with 1000 documents. A term appears in 10 documents and another term appears in 100, so the document frequencies are 10 and 100 respectively. The inverse document frequencies are 100 and 10. Idf is 100 for the term that has a df of 10 (1000/10), and idf is 10 for the document with df 100 (1000/100) by definition. Now as we can see the term that appears in 10 times more documents is considered to be 10 times less important. It’s expected that the more frequent term to be considered less important, but the factor 10 seems too harsh. Therefore, we take the logarithm of the inverse document frequencies. Let’s say the base of log is 2, than term that appears 10 times less often is considered to be around 3 times more important. So, the idf of a term t becomes:

idf_t = log\dfrac{N}{df_t}

This is better, and since log is a monotonically increasing function we can safely use it. Notice that idf never becomes negative because the denominator (df of a term) is always less than or equal to the size of the corpus (N). When a term appears in all documents, its df = N, then its idf becomes log(1) = 0. Which is ok because if a term appears in all documents, it doesn’t help us to distinguish between them. It’s basically a stopword, such as “the”, “a”, “an” etc. Also notice the resemblance between idf and the definition of in information theory. In our case p(x) is df/N, which is the probability of seeing a term in a randomly chosen document. And idf is –logp(x). The important result to note is, as more rare events occur, the information gain increases. Which means less frequent terms gives us more information.

Tf-idf scoring
We have defined both tf and idf, and now we can combine these to produce the ultimate score of a term t in document d. We will again represent the document as a vector, with each entry being the tf-idf weight of the corresponding term in the document. The tf-idf weight of a term t in document d is simply the multiplication of its tf by its idf:

tf\mbox{-}idf_{t,d} = tf_{t,d} \cdot idf_t

Let’s say we have a corpus containing K unique terms, and a document containing k unique terms. Using the vector space model, our document becomes a k-dimensional vector in a K-dimensional vector space. Generally k will be much less than K, because all terms in the corpus won’t appear in a single document. The values in the vector corresponding to the k terms that appear in the document will be their respective tf-idf weights, computed by the formula above. The entries corresponding to the K-k terms that don’t appear in the current document will be 0. Because their tf weight in the current document will be 0, since they don’t occur. Note that their idf scores won’t be 0, because idf is a collection-wise statistic, which depends on all the documents in the corpus. But tf is a document-wise statistic, which only depends on the current document. So, if a term doesn’t appear in the current document, it gets a tf score of 0. Multiplying tf and idf, the tf-idf weights of the missing K-k terms become 0. So, in the end we have a sparse vector with most of the entries being 0. To sum everything up, we represent documents as vectors in the vector space. A document vector has an entry for every term, with the value being its tf-idf score in the document.

We will also represent the query as a vector in the same K-dimensional vector space. It will have much fewer dimensions though, since queries are generally much shorter than the documents. Now let’s see how to find relevant documents to a query. Since both the query and the documents are represented as vectors in a common vector space, we can take advantage of this. We will compute the similarity score between the query vector and all the document vectors, and select the ones with top similarity values as relevant documents to the query. Before computing the similarity scores between vectors, we will perform one final operation as we did before, normalization. We will normalize both the query vector and all the document vectors, obtaining unit vectors.

Now that we have everything we need, we can finally compute the similarity scores between the query and document vectors, and rank the documents. The similarity score between two vectors in a vector space is the the angle between them. If two documents are similar they will be close to each other in the vector space, having a small angle in between. So given the vector representation of the documents, how do we compute the angle between them? We can do it very easily if the vectors are already normalized, which is true in our case, and this technique is called . We take the of the vectors and the result is the cosine value of the angle between them. Remember that when the angle is smaller its cosine value is larger, so when two vectors are similar their cosine similarity value will be larger. This gives us a great similarity metric with higher values meaning more similar and lower values meaning less. Therefore, if we compute the cosine similarity between the query vector and all the document vectors, sort them in descending order, and select the documents with top similarity, we will obtain an ordered list of relevant documents to this query. Voila! We now have a systematic methodology to get an ordered list of results to a query, ranking.

Source Code
Here is the source code. You also need to download the workspace from the create index post to obtain the necessary files. First run the create index program and then the query index. You can write your queries on command prompt and the program will display the top 10 documents that match the query in the order of decreasing tf-idf scores. Enjoy..

]]>
/2011/07/17/how-to-implement-a-search-engine-part-3-ranking-tf-idf/feed/ 2
How to Implement a Search Engine Part 2: Query Index /2011/05/31/how-to-implement-a-search-engine-part-2-query-index/?utm_source=rss&utm_medium=rss&utm_campaign=how-to-implement-a-search-engine-part-2-query-index /2011/05/31/how-to-implement-a-search-engine-part-2-query-index/#comments Tue, 31 May 2011 08:46:19 +0000 Arden /?p=165 Continue reading ]]> Overview
This is the second part of our implementing a search engine project. The first part was about creating the inverted index. Now, we will use the index to answer actual search queries.

Query Types
Let’s first remember the query types. Our search engine is going to answer 3 types of queries that we generally use while searching.
1) One Word Queries (OWQ): OWQ consist of only a single word. Such as computer, or university. The matching documents are the ones containing the single query term.
2) Free Text Queries (FTQ): FTQ contain sequence of words separated by space like an actual sentence. Such as computer science, or Brown University. The matching documents are the ones that contain any of the query terms.
3) Phrase Queries (PQ): PQ also contain sequence of words just like FTQ, but they are typed within double quotes. The meaning is, we want to see all query terms in the matching documents, and exactly in the order specified. Such as “Turing Award”, or “information retrieval and web search”.

Implementation
Create index program of the previous part creates the inverted index and saves it to disk. Our query index program will first read the index file from disk and construct the index back in memory, in the same format as in create index. As described in the previous post, each line in the index file corresponds to a term and its postings list. The format was: term|docID1:pos1,pos2;docID2:pos3,pos4,pos5;… We construct the index in memory by reading the index file line by line. After reading a line, we split it on character “|”. The first part is the term, and the second part is its postings list. We further split the postings list part as follows. First we split it on “;”. This gives us the document lists in which the term appears in. Then we split each of those first on ‘:’, and then on ‘,’ to get the document ID, and the list of positions that the term occurs in that document. We perform these operations on all lines of the index file, and construct the same inverted index in the create index program. Which is a dictionary where each term is a key and the value is its postings list.

After constructing the index from the file that create index outputted, we are ready to answer search queries. Each query type has its own implementation. OWQ, and FTQ are easier to implement, but PQ are a little more difficult. Let’s go through their implementations one by one. Here are the documents that we will use as examples:
Doc1: Brown University computer science department, computer department
Doc2: department of computer science Brown University – science department computer
Doc3: computer science at Brown & science computer

The inverted index of this collection of documents as follows (to obtain dictionary terms from words, stemming is omitted for easier demonstration; but the lowercasing the word, filtering out the stop words – the words “at” and “of” in this case – and eliminating non-alphanumeric characters are not omitted):
{‘brown’: [ [1, [0]], [2, [3]], [3, [2]] ], ‘university’: [ [1, [1]], [2, [4]] ], ‘computer’: [ [1, [2, 5]], [2, [1, 7], [3, [0, 4]] ], ‘science’: [ [1, [3]], [2, [2, 5]], [3, [1, 3]] ], ‘department’: [ [1, [4, 6], [2, [0, 6]] ] }

The transformations performed on words of the collection, such as stemming, lowercasing, removing stopwords, and eliminating non-alphanumeric characters will be performed on the query as well. So, querying for computer or Computer is basically the same.

One Word Queries
The input in OWQ is a single term, and the output is the list of documents containing that term. If the term doesn’t appear in the collection (hence there is no entry for that term in our index), then the result is an empty list. Let’s say we search for Brown. The output should be [1, 2, 3] because the term brown appears in documents 1, 2, and 3. If we query for university, then the result is [1, 2] because that term appears in documents 1 and 2. What we are doing is, we get the postings list of the query term and retrieve the document IDs, which is the first element in the document lists of the postings list. So, the python code for OWQ would be:

try:
    term=getQueryFromUser()
    docs=[posting[0] for posting in index[term]]
except:
    #term is not in index
    docs=[]

Free Text Queries
The input in FTQ is a sequence of words, and the output is the list of documents that contain any of the query terms. So, we will get the list of documents for each query term, and take the union of them. It’s like evaluating a OWQ for every query term, and taking the union of the results. So, for the query Brown University, the output would be [1, 2, 3]. Note that even though university doesn’t appear in the 3rd document, it still matches the query because the term brown appears in that document. The result for the query computer science department is [1, 2, 3] because each document contains at least one of the query terms. Again, a document need not contain all of the query terms to be a match for the query. Containing 1 or more of the query terms is sufficient. So, the code for FTQ would be:

#now the query is a list of terms
terms=getQueryFromUser()
docs=set()
for term in terms:
    try:
        termDocs=[posting[0] for posting in index[term]]
        docs|=set(termDocs)
    except:
        #term is not in index
        pass
docs=list(docs)

Phrase Queries
The input in PQ is again a sequence of words, and the matching documents are the ones that contain all query terms in the specified order. We will now use the positional information about the terms which we didn’t need to use in OWQ and FTQ. The implementation is as follows. First we need the documents that all query terms appear in. We will again get the list of documents for each query term as we did in FTQ, but now we will take the intersection of them instead of union. Because we want the documents that all query terms appear in, instead of the documents that any query term appears. Then, we should check whether they are in correct order or not. This is the tricky part. For each document that contains all query terms, we should do the following operations. Get positions of the query terms in the current document, and put each to a separate list. So, if there are n query terms, there will be n lists where each list contains the positions of the corresponding query term in the current document. Leave the position list of the 1st query term as it is, subtract 1 from each element of the 2nd position list, subtract 2 from each element of the 3rd position list, …, subtract n-1 from each element of nth position list. Then intersect all the position lists. If the result of the intersection is non-empty, then all query terms appear in the current document in correct order, meaning this is a matching document. Perform these operations on all documents that every query term appears in. The matching documents are the ones that have non-empty intersection. Why does this algorithm work? Because, for each query term, we check whether it appears right after the previous query terms in the document. And we do this in an elegant way. Additionally, there is an optimization while performing position list intersections. We can start intersecting from smaller lists, because the size of the final intersection is always less than or equal to the size of the smallest list. Therefore, we can short-circuit whenever the intermediate intersection result becomes an empty list, obtaining an efficient algorithm.

An example will make everything clear. Let’s say we search for “computer science department”. We first get the document list of all query terms, as we did in FTQ: computer: [1, 2, 3], science: [1, 2, 3], and department: [1, 2]. Then we intersect all these lists to get the documents that contain all query terms, which is [1, 2]. Next, we should check whether the order is correct or not. First, we get the postings list of the query terms for document 1. Which is computer: [1, [2, 5]], science: [1, [3]], and department: [1, [4, 6]. Then, we extract the positions of each query term, and put them in separate lists, resulting in [ [2, 5], [3], [4, 6] ]. Each list corresponds to the positional information of a query term. We don’t touch the first list, but subtract i-1 from the elements in the ith list, resulting in [ [2, 5], [2], [2, 4] ]. Finally, we take the intersection of the lists, which is [2]. Since the intersection is not empty, we conclude that document 1 is a matching document. Next, we check document 2. Get the positions of query terms and put them to separate lists as before: [ [1, 7], [2, 5], [0, 6] ]. Perform the subtractions: [ [1, 7], [1, 4], [-2, 4] ]. And take the intersection: []. The result of the intersection is empty, meaning the query terms don’t appear in the correct order, so this is not a matching document. There is no more document that contains all query terms. So, the result of the phrase query is document 1 only: [1]. Here is the high level overview of the implementation (you can find all the details in the source code):

#get query terms
terms=getQueryFromUser()
for term in terms:
    if term not in index:
        #if a query term is not in the index,
        #there can't be any document containing it
        #so there is no match for the query
        return []
 
postings=getPostings(terms)
docs=getDocsFromPostings(postings)
docs=intersectLists(docs)
if docs==[]:
    #no document contains all query terms
    return []
 
#get the posting list of terms in docs
postings=getPostingsOfDocs(docs)
 
result=[]
#check whether terms are in correct order
#perform the necessary subtractions
postings=performSubtractions(postings)
#intersect the locations
for posting in postings:
    if intersectPositions(posting)!=[]:
        #current document is a match
        result.append(currentDocument)
 
return result

Source Code
Here is the source code. Note that you first have to create index in order to query it. Or you can directly download and use the workspace (size: 10MB). You can write your queries on command prompt and the program will display the document IDs that match the query. The results are not ranked, we will discuss how to rank search results in the next post.

]]>
/2011/05/31/how-to-implement-a-search-engine-part-2-query-index/feed/ 9
How to Implement a Search Engine Part 1: Create Index /2011/05/30/how-to-implement-a-search-engine-part-1-create-index/?utm_source=rss&utm_medium=rss&utm_campaign=how-to-implement-a-search-engine-part-1-create-index /2011/05/30/how-to-implement-a-search-engine-part-1-create-index/#comments Mon, 30 May 2011 07:17:29 +0000 Arden /?p=130 Continue reading ]]> Overview
Ok, let’s start! We will implement a search engine that answers queries on Wikipedia articles. There will be two main parts of the project. First creating the index by going through the documents, and second answering the search queries using the index we created. Then we will also add ranking, classification, compression, and duplicate detection mechanisms.

What is our corpus
Our corpus (document collection) is Wikipedia articles. To simplify our work and focus on the core algorithms, we won’t write the crawler to get the articles from Wikipedia. We will use a prebuilt file which contains approximately 50,000 articles from various fields ranging from computer science to arts. The structure of the file is as follows:

<page>
<id> pageID (an integer) </id>
<title> title of the page </title>
<text>
Contents of the article
Can span multiple lines
May contain non-alphanumeric characters and links
</text>
</page>
<page>
<id> another pageID (every pageID is distinct) </id>
<title> title of another page </title>
<text>
Body of the article
</text>
</page>

As we can see, there is an XML like structure in the collection. Every article lies between <page> and </page> tags, where the pageID, title, and text is separated by the corresponding tags. We will write our own routine to parse this structure by using regular expressions. We will assume that these special tags won’t appear in the body of the articles. However, the articles may contain non-alphanumeric characters and capital letters that we will pay special attention to while building the index. For example we don’t want to index apple, Apple, and APPLE differently. They are basically all apple. Or considering the grammatical structure of the language, we don’t want to index research, researches, and researching separately. They are all about research. When we search for one of these terms we would expect results containing any of those variations. Additionally, we wouldn’t like to index words such as ‘the’, ‘a’, ‘an’ because they appear in almost every document and they don’t give us very much information about the document or the query. These very common words are called stop words.

Inverted Index
Chapters 1 and 2 of the Introduction to Information Retrieval cover the basics of the inverted index very well. To summarize, an inverted index is a data structure that we build while parsing the documents that we are going to answer the search queries on. Given a query, we use the index to return the list of documents relevant for this query. The inverted index contains mappings from terms (words) to the documents that those terms appear in. Each vocabulary term is a key in the index whose value is its postings list. A term’s postings list is the list of documents that the term appears in. To illustrate with an example, if we have the following documents:
Document 1: Information Retrieval and Web Search
Document 2: Search Engine Ranking
Document 3: Web Search Course

Then the postings list of the term ‘web’ would be the list [1, 3], meaning the term ‘web’ appears in documents with IDs 1 and 3. Similarly the postings list of the term ‘search’ would be [1, 2, 3], and for the term ‘course’ the postings list would be [3]. We may want to keep additional information in the index such as the number of occurrences of the term in the whole collection (its term frequency), or the number of different documents that the term appears in (its document frequency), the positions of the term’s occurrences within a document etc. The amount of information we keep in our index will grow as we add more functionality to our search engine.

Query Types
So, what types of queries our search engine will answer? We will answer the types of queries that we use while doing search every day. Namely:
1) One Word Queries: Queries that consist of a single word. Such as movie, or hotel.
2) Free Text Queries: Queries that contain sequence of words separated by space. Such as godfather movie, or hotels in San Francisco.
3) Phrase Queries: These are more advance queries that again consist of sequence of words separated by space, but they are typed inside double quotes and we want the matching documents to contain the terms in the query exactly in the specified order. Such as “godfather movie”.

Parsing the Collection
While parsing the document collection we will decide which words will be the terms in the index. As mentioned above, we don’t want every word to be a term. So, while parsing the Wikipedia articles we will perform the following operations on each page in this order:
1) Concatenate the title and the text of the page.
2) Lowercase all words.
3) Get all tokens, where a token is a string of alphanumeric characters terminated by a non-alphanumeric character. The alphanumeric characters are defined to be [a-z0-9]. So, the tokens for the word ‘apple+orange’ would be ‘apple’ and ‘orange’.
4) Filter out all the tokens that are in the stop words list, such as ‘a’, ‘an’, ‘the’.
5) Stem each token using to finally obtain the stream of terms. Porter Stemmer removes common endings from words. For example the stemmed version of the words fish, fishes, fishing, fisher, fished are all fish.

Building the Inverted Index
The inverted index is the main data structure of our search engine. We will use a Hashtable (python’s dictionary) to store the inverted index in memory. The reason is we will perform lots of lookups (one for every term in the document), and we will also add lots of keys (every term is a key), so we want these operations to be very efficient. Since Hashtables have average O(1) lookup time and amortized O(1) insertion time, they are very suitable for our needs.

As we discussed earlier, every term will be a key in our dictionary, whose value is its postings list (the list of documents that the term appears in). However, we would like to keep one additional information in the postings list, the positions of term occurrences within the document. The reason is to answer the phrase queries we need positional information, because we want to check whether the terms in the query appear in the specified order. Without knowing the positions of the terms in the document, we can only check whether the query terms simply appear in a document. To verify the order, we need to know their positions. So for every occurrence of a term in a document, we will keep the occurrence position. For example, if we have the documents:
Document 1: web retrieval web search information
Document 2: search engine web ranking
Document 3: web search course information search

Now the postings list for the term ‘web’ is [ [1, [0, 2]], [2, [2]], [3, [1]] ]. Meaning, the term ‘web’ appears in document 1 in positions 0 and 2 (we start counting positions from 0), document 2 position 2, and document 3 position 1. The postings list of a term is a list of lists, where each list corresponds to a specific document. So, there is a list for every document that the term appears in. Each of these lists contain the document ID as the first element, and the list of occurrences of the term in that document as the second element. As another example, the postings list of the term ‘search’ is [ [1, [3]], [2, [0]], [3, [1, 4]] ]. Because ‘search’ appears in document 1 position 3, document 2 position 0, and document 3 positions 1 and 4. If a term doesn’t appear in a document, it postings list simply doesn’t have an entry for that document. So, the postings list of the term ‘information’ is [ [1, [4]], [3, [3]] ].

We build the index as follows. First, we extract a page from the collection with our parsing routine. The page content is between <page> and </page> tags. Then we perform the operations listed in the section Parsing the Collection. As a result we have the list of terms in the document. Then we build the inverted index for the page in the format described above. But notice that since we are building an index for just the current page, the postings list of a term won’t be a list of lists. It will simply be a list where the first element is the document ID, and the second element is the list of positions the term appears in that document. Then we merge the index of the current page with the main inverted index, which is the index for the whole corpus. The merging is simple. For every term in the current page, we append its postings list to the postings list of that term in the main index (which is a list of lists as described above).

So, if we use the above example. First, we extract Document 1:
web retrieval web search information
Then we build the index for this document (note that the terms are not the stemmed versions for demonstration. In the actual program a term would be stemmed by Porter Stemmer before being added to the index. So the word ‘retrieval’ would be added to the index as the term ‘retriev’ after being stemmed):
{ ‘web’: [1, [0, 2]], ‘retrieval’: [1, [1]], ‘search’: [1, [3]], ‘information’: [1, [4]] }
Then we merge this dictionary with our main dictionary (which is currently empty because this is the first document in the collection). Our main dictionary becomes:
{ ‘web’: [ [1, [0, 2]] ], ‘retrieval’: [ [1, [1]] ], ‘search’: [ [1, [3]] ], ‘information’: [ [1, [4]] ] }
Note that the postings of a term in the current page index is added inside a list in the main index, because the main index postings lists are list of lists, a list for every document the term appears in.

Then we extract the second document:
search engine web ranking
and build the index for that document:
{ ‘search’: [2, [0]], ‘engine’: [2, [1]], ‘web’: [2, [2]], ‘ranking’: [2, [3]] }
And we merge the current page index with the main index. Which is simply appending the current postings of a term to the postings list of the corresponding term in the main index. After merging, our main index becomes:
{ ‘web’: [ [1, [0, 2]], [2, [2]] ], ‘retrieval’: [ [1, [1]] ], ‘search’: [ [1, [3]], [2, [0]] ], ‘information’: [ [1, [4]] ], ‘engine’: [ [2, [1]] ], ‘ranking’: [ [2, [3]] ] }

We continue like this and build our main inverted index for the whole collection. Our query answering program will use this inverted index. So, we should save this index to a file because our query index program will be separate. First the create index program will run and write the index to a file. Then the query index program will execute by reading the index from the file and answering the search queries using that index. So, we should decide a format for saving the index to the file.

The index is stored as text in the following format:
term|docID1:pos1,pos2;docID2:pos3,pos4,pos5;…
Every line of the file contains a separate term. The line starts with a term, and then the character ‘|’ is used to separate the term from its postings list. The postings list of a term has the following form. First document ID containing the term, followed by a colon, followed by the positions of the term in that document with commas in between, semicolon, second document ID containing the term, colon, comma delimited positions, and it goes on like this. Using the above example, the term – postings list pair ‘web’: [ [1, [0, 2]], [2, [2]] ] would be saved as:
web|1:0,2;2:2

Here is another example of index creation. All the parsing operations are performed (including stemming) on the words to obtain the terms.

Experiments
We have two collections. A small test collection of size 40MB which contains around 5,000 documents, and a larger full collection of size 300MB containing around 40,ooo documents. These collections are of course not representative of the web, but they are good for experimentation purposes. Let’s look at some statistics:

Test CollectionFull Collection
Time without Stemming (min:sec)0:233:14
Time with Stemming (min:sec)1:5615:16
Number of Documents5,15541,141
Size of Collection (MB)38301
Index Size (MB)25201
Number of Terms145,005616,723

Stemming is a very useful and important operation, but it increases the execution time by a factor of 5. The reason is that stemming is performed on every term and it contains lots of checks. All statistics except the first row is reported with stemming performed.

The distribution of terms in a collection generally follows , which states that the frequency of a word is inversely proportional to its rank. So, the most frequent word will appear 2 times more than the second most frequent word, 5 times more than the fifth most frequent word etc. Let’s verify this by plotting term frequencies in the full collection:

The graph is a log-log plot. The x-axis is the rank of the term (first one is the most frequent term, second one is the second most frequent etc.), and the y-axis is the collection frequency of that term (number of occurrences in the full collection). The term distributions approximately follows Zipf’s Law. Even though the model is not a perfect fit, it’s good enough to be considered as a useful model.

Source Code
I wrote the code in python because it’s both easily understood and widely used. Here is the source code. If you want to test it yourself, here is the workspace (size: 10MB). It contains the compressed test collection, stopwords file, Porter Stemmer, and the source code.

Next Steps..
So, this was our create index program. We will next write the query index program that answers search queries using the index that we just built. Then we will add ranking to our search engine. We will implement (term frequency – inverse document frequency) ranking scheme and . Then we will add classifiers to our search engine. Namely ,  (SVM), and classifiers. We will also compress the index stored on disk using with .

]]>
/2011/05/30/how-to-implement-a-search-engine-part-1-create-index/feed/ 7