What if we build index for the phrases itself? instead of each single word… will it help?
]]>About your second question, changing how you represent the corpus will only require modifications in the createIndex program. You’ll need to parse xml and get document content accordingly, and then create the inverted index. As long as you create the index in the same format, queryIndex stays the same.
Sorry for the late response by the way..
]]>Let’s say we have N lists and there are K elements per list on the average. The complexity of the subtraction approach is: O(NK)
O(NK) for subtractions and also O(NK) for set intersections (they can be done in linear time using hashtables).
The complexity of the alternative approach would be: O(KNlogK)
For all elements in the first list (K), check whether the correct number appears in all other lists (NlogK. logK to binary search a list and we do this for all N lists). It may not seem as a big difference but note that generally K>>N. Because there will be just a couple of terms in the query (N), but those terms will appear many times in the documents (K).
In my toy example the difference isn’t noticed, but imagine those 3 lists to be longer (let’s say they contain 50 elements). And suppose that the first match occurs at 40th element in the first list. Then, for 39 previous elements, we have to check whether +1 is present in the second list, and +2 is present in the 3rd list, repetitively. We can instead perform the subtractions just once, and check for a common element by simply intersecting the lists.
Here is the intersectLists function for clarification:
def intersectLists(lists): if len(lists)==0: return [] #start intersecting from the smaller list lists.sort(key=len) return list(reduce(lambda x,y: set(x) & set(y),lists)) |