Comments on: How to Implement a Search Engine Part 2: Query Index

By: Haja

Tue, 25 Dec 2012 07:47:00 +0000

Hi Arden, What about response time for the logic you suggested for Pharase Query. If the index size is million of words, will this logic is efficient to give the answer quickly to user for the phrase query? Is this same logic used by Bing and Google? Or is there is any other faster and efficient way to do it?

What if we build index for the phrases itself? instead of each single word… will it help?

]]>

By: here

Mon, 28 May 2012 08:13:46 +0000

This blog site is extremely good! How can I make one like this ?

]]>

By: Arden

Fri, 16 Mar 2012 00:58:55 +0000

You can input queries from a file simply by using file redirection, i.e. python queryIndex.py < queries.txt. The source code gets the search query via readline in a while loop, so it’ll execute queries provided by the user or from file.

About your second question, changing how you represent the corpus will only require modifications in the createIndex program. You’ll need to parse xml and get document content accordingly, and then create the inverted index. As long as you create the index in the same format, queryIndex stays the same.

Sorry for the late response by the way..

]]>

By: olga

Thu, 16 Feb 2012 16:19:09 +0000

Well,what if we had a “queries.txt” file given to us by default(the user does not input any queries)?Another thing, if we had the wikipedia corpus in a directory including the xml files, what would change to your former code about testCollection?How would we test the codes then?I’m a newbie in python and your posts are the most useful I have found about creating search machines.

]]>

By: Arden

Tue, 19 Jul 2011 18:38:41 +0000

You can do like that as well, it’s essentially the same thing. However, it’ll be less efficient. Because now we have to check for all elements in first list, whether the corresponding numbers are present in all other lists, one by one. We can use binary search to perform that check, since the lists are already sorted, but then it becomes more complicated and the code is much longer. In the subtraction approach, we’re basically doing the same thing but in a more efficient way. Because we’re taking advantage of python’s built in set data structure to perform intersections (the details are in the source code, I didn’t include it in the above code snippet to keep it simple).

Let’s say we have N lists and there are K elements per list on the average. The complexity of the subtraction approach is: O(NK)
O(NK) for subtractions and also O(NK) for set intersections (they can be done in linear time using hashtables).
The complexity of the alternative approach would be: O(KNlogK)
For all elements in the first list (K), check whether the correct number appears in all other lists (NlogK. logK to binary search a list and we do this for all N lists). It may not seem as a big difference but note that generally K>>N. Because there will be just a couple of terms in the query (N), but those terms will appear many times in the documents (K).

In my toy example the difference isn’t noticed, but imagine those 3 lists to be longer (let’s say they contain 50 elements). And suppose that the first match occurs at 40th element in the first list. Then, for 39 previous elements, we have to check whether +1 is present in the second list, and +2 is present in the 3rd list, repetitively. We can instead perform the subtractions just once, and check for a common element by simply intersecting the lists.

Here is the intersectLists function for clarification:

def intersectLists(lists):
        if len(lists)==0:
            return []
        #start intersecting from the smaller list
        lists.sort(key=len)
        return list(reduce(lambda x,y: set(x) & set(y),lists))

]]>

By: hero

Tue, 19 Jul 2011 13:44:53 +0000

Hi Arden, why don’t you compare the next blocks one by one instead of subtracting numbers from the list and then check whether the list contains a common number?
I mean, for this list: [ [2, 5], [3], [4,6] ] you could have 2 in mind and check if the next block contains 2+1=3 and the next block contains 2+2=4.

]]>

By: Florentina Acedo

Sun, 17 Jul 2011 22:45:55 +0000

Remarkable post, thank you, I’ll book mark you now!

]]>

By: agundez

Mon, 11 Jul 2011 16:36:03 +0000

your weblog is so uncomplicated to study, i like this write-up, so maintain submitting much more :)

]]>

By: Canan

Fri, 01 Jul 2011 17:51:39 +0000

Saved as a favorite, I really like your blog! :)

]]>