Arden Dertat | 3/3 | Web Search and Information RetrievalArden Dertat | Web Search and Information Retrieval

Programming Interview Questions 1: Array Pair Sum

Posted on September 17, 2011 by Arden

Once again it’s the college recruiting season of the year and tech companies started the interview process for full time and internship positions. I had many interviews last year these days for a summer internship. Eventually I was an intern at Microsoft Bing, and will be joining there full time next summer. I won’t have any interviews this year, but since most of my friends are actively preparing for them nowadays, I thought it would be useful to share some good quality interview questions and provide my solutions. I come across this particular question pretty often recently: Given an integer array, output all pairs that sum up to a specific value k.

Continue reading →

Rating: 10.0/10 (3 votes cast)

Posted in Programming Interview | 6 Comments

How to Implement a Search Engine Part 3: Ranking tf-idf

Posted on July 17, 2011 by Arden

Overview
We have come to the third part of our implementing a search engine project, ranking. The first part was about creating the index, and the second part was querying the index. We basically have a search engine that can answer search queries on a given corpus, but the results are not ranked. Now, we will include ranking to obtain an ordered list of results, which is one of the most challenging and interesting parts. The first ranking scheme we will implement is . In the following articles, we’ll analyze , which is a variant of tf-idf. We will also implement Google’s . Then we will explore Machine Learning techniques such as , (SVM), , and so forth.

Tf-idf is a weighting scheme that assigns each term in a document a weight based on its term frequency (tf) and inverse document frequency (idf). The terms with higher weight scores are considered to be more important. It’s one of the most popular weighting schemes in Information Retrieval.

Continue reading →

Rating: 9.5/10 (4 votes cast)

Posted in Information Retrieval, Search Engines, Web Search | Leave a comment

How to Implement a Search Engine Part 2: Query Index

Posted on May 31, 2011 by Arden

Overview
This is the second part of our implementing a search engine project. The first part was about creating the inverted index. Now, we will use the index to answer actual search queries.

Query Types
Let’s first remember the query types. Our search engine is going to answer 3 types of queries that we generally use while searching.
1) One Word Queries (OWQ): OWQ consist of only a single word. Such as computer, or university. The matching documents are the ones containing the single query term.
2) Free Text Queries (FTQ): FTQ contain sequence of words separated by space like an actual sentence. Such as computer science, or Brown University. The matching documents are the ones that contain any of the query terms.
3) Phrase Queries (PQ): PQ also contain sequence of words just like FTQ, but they are typed within double quotes. The meaning is, we want to see all query terms in the matching documents, and exactly in the order specified. Such as “Turing Award”, or “information retrieval and web search”.

Continue reading →

Rating: 10.0/10 (3 votes cast)

Posted in Information Retrieval, Search Engines, Web Search | 5 Comments

My Favorite Interview Question

Posted on May 30, 2011 by Arden

I am working at Microsoft Bing as an intern this summer. To get an internship I had lots of interviews with various tech companies this year, and this was my favorite question: In an integer array with N elements (N is large), find the minimum k elements (k << N).

Continue reading →

Rating: 10.0/10 (4 votes cast)

Posted in Programming Interview | 9 Comments

Python Optimization

Posted on May 30, 2011 by Arden

Fast executing efficient code is always desirable, especially for programs that will operate on web-scale data. Small performance gains will lead to big time improvements due to massive size of the input. Here I want to share python optimization lessons I learned while coding the create index program of the previous post. I assume that you are familiar with python basics such as list comprehensions, map, filter, etc.

Map vs List Comprehension
If you’re going to use lambda functions, use list comprehension instead of map.
If you’re going to use built-in functions, use map instead of list comprehension.

Continue reading →

Rating: 10.0/10 (1 vote cast)

Posted in Optimization, Python | Leave a comment

How to Implement a Search Engine Part 1: Create Index

Posted on May 30, 2011 by Arden

Overview
Ok, let’s start! We will implement a search engine that answers queries on Wikipedia articles. There will be two main parts of the project. First creating the index by going through the documents, and second answering the search queries using the index we created. Then we will also add ranking, classification, compression, and duplicate detection mechanisms.

What is our corpus
Our corpus (document collection) is Wikipedia articles. To simplify our work and focus on the core algorithms, we won’t write the crawler to get the articles from Wikipedia. We will use a prebuilt file which contains approximately 50,000 articles from various fields ranging from computer science to arts. The structure of the file is as follows:

Continue reading →

Rating: 10.0/10 (3 votes cast)

Posted in Information Retrieval, Search Engines, Web Search | 3 Comments

About me..

Posted on March 29, 2011 by Arden

Hello world! I’m Arden. I’m a first year Computer Science Master student at . My specialty is Information Retrieval, particularly Web Search. I’m from beautiful city Istanbul, Turkey. I did my undergraduate at Bogazici University Computer Engineering Department (though the curriculum was more Computer Science than Engineering, we don’t have CS at Turkey). This is my first year in US, and I really enjoy the life here. I will do an internship at Microsoft this summer, in the Bing team at Seattle.

Continue reading →

Rating: 10.0/10 (5 votes cast)

Posted in About | 4 Comments

Recent Posts