How to Implement a Search Engine Part 2: Query Index

Overview
This is the second part of our implementing a search engine project. The first part was about creating the inverted index. Now, we will use the index to answer actual search queries.

Query Types
Let’s first remember the query types. Our search engine is going to answer 3 types of queries that we generally use while searching.
1) One Word Queries (OWQ): OWQ consist of only a single word. Such as computer, or university. The matching documents are the ones containing the single query term.
2) Free Text Queries (FTQ): FTQ contain sequence of words separated by space like an actual sentence. Such as computer science, or Brown University. The matching documents are the ones that contain any of the query terms.
3) Phrase Queries (PQ): PQ also contain sequence of words just like FTQ, but they are typed within double quotes. The meaning is, we want to see all query terms in the matching documents, and exactly in the order specified. Such as “Turing Award”, or “information retrieval and web search”.

Implementation
Create index program of the previous part creates the inverted index and saves it to disk. Our query index program will first read the index file from disk and construct the index back in memory, in the same format as in create index. As described in the previous post, each line in the index file corresponds to a term and its postings list. The format was: term|docID1:pos1,pos2;docID2:pos3,pos4,pos5;… We construct the index in memory by reading the index file line by line. After reading a line, we split it on character “|”. The first part is the term, and the second part is its postings list. We further split the postings list part as follows. First we split it on “;”. This gives us the document lists in which the term appears in. Then we split each of those first on ‘:’, and then on ‘,’ to get the document ID, and the list of positions that the term occurs in that document. We perform these operations on all lines of the index file, and construct the same inverted index in the create index program. Which is a dictionary where each term is a key and the value is its postings list.

After constructing the index from the file that create index outputted, we are ready to answer search queries. Each query type has its own implementation. OWQ, and FTQ are easier to implement, but PQ are a little more difficult. Let’s go through their implementations one by one. Here are the documents that we will use as examples:
Doc1: Brown University computer science department, computer department
Doc2: department of computer science Brown University – science department computer
Doc3: computer science at Brown & science computer

The inverted index of this collection of documents as follows (to obtain dictionary terms from words, stemming is omitted for easier demonstration; but the lowercasing the word, filtering out the stop words – the words “at” and “of” in this case – and eliminating non-alphanumeric characters are not omitted):
{‘brown’: [ [1, [0]], [2, [3]], [3, [2]] ], ‘university’: [ [1, [1]], [2, [4]] ], ‘computer’: [ [1, [2, 5]], [2, [1, 7], [3, [0, 4]] ], ‘science’: [ [1, [3]], [2, [2, 5]], [3, [1, 3]] ], ‘department’: [ [1, [4, 6], [2, [0, 6]] ] }

The transformations performed on words of the collection, such as stemming, lowercasing, removing stopwords, and eliminating non-alphanumeric characters will be performed on the query as well. So, querying for computer or Computer is basically the same.

One Word Queries
The input in OWQ is a single term, and the output is the list of documents containing that term. If the term doesn’t appear in the collection (hence there is no entry for that term in our index), then the result is an empty list. Let’s say we search for Brown. The output should be [1, 2, 3] because the term brown appears in documents 1, 2, and 3. If we query for university, then the result is [1, 2] because that term appears in documents 1 and 2. What we are doing is, we get the postings list of the query term and retrieve the document IDs, which is the first element in the document lists of the postings list. So, the python code for OWQ would be:

try:
    term=getQueryFromUser()
    docs=[posting[0] for posting in index[term]]
except:
    #term is not in index
    docs=[]

Free Text Queries
The input in FTQ is a sequence of words, and the output is the list of documents that contain any of the query terms. So, we will get the list of documents for each query term, and take the union of them. It’s like evaluating a OWQ for every query term, and taking the union of the results. So, for the query Brown University, the output would be [1, 2, 3]. Note that even though university doesn’t appear in the 3rd document, it still matches the query because the term brown appears in that document. The result for the query computer science department is [1, 2, 3] because each document contains at least one of the query terms. Again, a document need not contain all of the query terms to be a match for the query. Containing 1 or more of the query terms is sufficient. So, the code for FTQ would be:

#now the query is a list of terms
terms=getQueryFromUser()
docs=set()
for term in terms:
    try:
        termDocs=[posting[0] for posting in index[term]]
        docs|=set(termDocs)
    except:
        #term is not in index
        pass
docs=list(docs)

Phrase Queries
The input in PQ is again a sequence of words, and the matching documents are the ones that contain all query terms in the specified order. We will now use the positional information about the terms which we didn’t need to use in OWQ and FTQ. The implementation is as follows. First we need the documents that all query terms appear in. We will again get the list of documents for each query term as we did in FTQ, but now we will take the intersection of them instead of union. Because we want the documents that all query terms appear in, instead of the documents that any query term appears. Then, we should check whether they are in correct order or not. This is the tricky part. For each document that contains all query terms, we should do the following operations. Get positions of the query terms in the current document, and put each to a separate list. So, if there are n query terms, there will be n lists where each list contains the positions of the corresponding query term in the current document. Leave the position list of the 1st query term as it is, subtract 1 from each element of the 2nd position list, subtract 2 from each element of the 3rd position list, …, subtract n-1 from each element of nth position list. Then intersect all the position lists. If the result of the intersection is non-empty, then all query terms appear in the current document in correct order, meaning this is a matching document. Perform these operations on all documents that every query term appears in. The matching documents are the ones that have non-empty intersection. Why does this algorithm work? Because, for each query term, we check whether it appears right after the previous query terms in the document. And we do this in an elegant way. Additionally, there is an optimization while performing position list intersections. We can start intersecting from smaller lists, because the size of the final intersection is always less than or equal to the size of the smallest list. Therefore, we can short-circuit whenever the intermediate intersection result becomes an empty list, obtaining an efficient algorithm.

An example will make everything clear. Let’s say we search for “computer science department”. We first get the document list of all query terms, as we did in FTQ: computer: [1, 2, 3], science: [1, 2, 3], and department: [1, 2]. Then we intersect all these lists to get the documents that contain all query terms, which is [1, 2]. Next, we should check whether the order is correct or not. First, we get the postings list of the query terms for document 1. Which is computer: [1, [2, 5]], science: [1, [3]], and department: [1, [4, 6]. Then, we extract the positions of each query term, and put them in separate lists, resulting in [ [2, 5], [3], [4, 6] ]. Each list corresponds to the positional information of a query term. We don’t touch the first list, but subtract i-1 from the elements in the ith list, resulting in [ [2, 5], [2], [2, 4] ]. Finally, we take the intersection of the lists, which is [2]. Since the intersection is not empty, we conclude that document 1 is a matching document. Next, we check document 2. Get the positions of query terms and put them to separate lists as before: [ [1, 7], [2, 5], [0, 6] ]. Perform the subtractions: [ [1, 7], [1, 4], [-2, 4] ]. And take the intersection: []. The result of the intersection is empty, meaning the query terms don’t appear in the correct order, so this is not a matching document. There is no more document that contains all query terms. So, the result of the phrase query is document 1 only: [1]. Here is the high level overview of the implementation (you can find all the details in the source code):

#get query terms
terms=getQueryFromUser()
for term in terms:
    if term not in index:
        #if a query term is not in the index,
        #there can't be any document containing it
        #so there is no match for the query
        return []
 
postings=getPostings(terms)
docs=getDocsFromPostings(postings)
docs=intersectLists(docs)
if docs==[]:
    #no document contains all query terms
    return []
 
#get the posting list of terms in docs
postings=getPostingsOfDocs(docs)
 
result=[]
#check whether terms are in correct order
#perform the necessary subtractions
postings=performSubtractions(postings)
#intersect the locations
for posting in postings:
    if intersectPositions(posting)!=[]:
        #current document is a match
        result.append(currentDocument)
 
return result

Source Code
Here is the source code. Note that you first have to create index in order to query it. Or you can directly download and use the workspace (size: 10MB). You can write your queries on command prompt and the program will display the document IDs that match the query. The results are not ranked, we will discuss how to rank search results in the next post.

VN:F [1.9.11_1134]
Rating: 10.0/10 (3 votes cast)
How to Implement a Search Engine Part 2: Query Index, 10.0 out of 10 based on 3 ratings

This entry was posted in Information Retrieval, Search Engines, Web Search. Bookmark the permalink.

5 Responses to How to Implement a Search Engine Part 2: Query Index

  1. Canan says:

    Saved as a favorite, I really like your blog! :)

  2. agundez says:

    your weblog is so uncomplicated to study, i like this write-up, so maintain submitting much more :)

  3. Remarkable post, thank you, I’ll book mark you now!

  4. hero says:

    Hi Arden, why don’t you compare the next blocks one by one instead of subtracting numbers from the list and then check whether the list contains a common number?
    I mean, for this list: [ [2, 5], [3], [4,6] ] you could have 2 in mind and check if the next block contains 2+1=3 and the next block contains 2+2=4.

    • Arden says:

      You can do like that as well, it’s essentially the same thing. However, it’ll be less efficient. Because now we have to check for all elements in first list, whether the corresponding numbers are present in all other lists, one by one. We can use binary search to perform that check, since the lists are already sorted, but then it becomes more complicated and the code is much longer. In the subtraction approach, we’re basically doing the same thing but in a more efficient way. Because we’re taking advantage of python’s built in set data structure to perform intersections (the details are in the source code, I didn’t include it in the above code snippet to keep it simple).

      Let’s say we have N lists and there are K elements per list on the average. The complexity of the subtraction approach is: O(NK)
      O(NK) for subtractions and also O(NK) for set intersections (they can be done in linear time using hashtables).
      The complexity of the alternative approach would be: O(KNlogK)
      For all elements in the first list (K), check whether the correct number appears in all other lists (NlogK. logK to binary search a list and we do this for all N lists). It may not seem as a big difference but note that generally K>>N. Because there will be just a couple of terms in the query (N), but those terms will appear many times in the documents (K).

      In my toy example the difference isn’t noticed, but imagine those 3 lists to be longer (let’s say they contain 50 elements). And suppose that the first match occurs at 40th element in the first list. Then, for 39 previous elements, we have to check whether +1 is present in the second list, and +2 is present in the 3rd list, repetitively. We can instead perform the subtractions just once, and check for a common element by simply intersecting the lists.

      Here is the intersectLists function for clarification:

      def intersectLists(lists):
              if len(lists)==0:
                  return []
              #start intersecting from the smaller list
              lists.sort(key=len)
              return list(reduce(lambda x,y: set(x) & set(y),lists))

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">