Implementation of search engine ranking in Python (Part 2)

previous section we built the index, but we still can't run queries on it. About this I and tell in this article.

querying the index

So, there are two types of queries that we want to process: standard queries, where at least one of the words in the query appears in the document and the query phrase, where all the query words encountered in the document in the same order.

However, before we start, I would recommend to handle the request the same way as we processed the documents when you built the index by converting all words, making all letters lowercase and removing punctuation. I won't go into it because it's trivial, but it must be done before executing the query.

note: in all code examples below, each function will use a variable named ‘invertedIndex’, which is generated in the previous part of the article. For a full understanding of what is happening below you can see the final result on GitHub.

We are going to implement standard queries in the first place. Easy way to get them is to break the query into words (tokens as described above), to obtain the list for each word, the documents in which they occur, and then combine all these lists. Here is how we will fulfill the request for a single word:

the

def one_word_query(self, word):
pattern = re.compile('[\W_]+')
word = pattern.sub(' ',word)
if word in self.invertedIndex.keys():
return [filename for filename in self.invertedIndex[word].keys()]
else:
return []

This code is pretty simple. Everything we do here is the processed input using a regular expression and return a list of all keys in the hash table for the word in the index (which is a list of file names that contain the word).

Now a standard request is a very simple extension. We just aggregate and combine the lists as shown here:

the

def free_text_query(self, string):
pattern = re.compile('[\W_]+')
string = pattern.sub(' ',string)
result = []
for word in string.split():
result += self.one_word_query(word)
return list(set(result))

If you want to make a query that ensures that each word in the query is found in the final list, then you have to use the intersection instead of merging with multiple results of individual queries containing the word. It's pretty trivial to do, and I'll leave it as an exercise for the reader.

The last type of query is a query phrase that a little more difficult as we need to ensure the correct order of words in documents. Here is the code for the query (I'll explain later):

the

def phrase_query(self, string):
pattern = re.compile('[\W_]+')
string = pattern.sub(' ',string)
listOfLists, result = [],[]
for word in string.split():
listOfLists.append(self.one_word_query(word))
setted = set(listOfLists[0]).intersection(*listOfLists)
for filename in setted:
temp = []
for word in string.split():
temp.append(self.invertedIndex[word][filename][:])
for i in range(len(temp)):
for ind in range(len(temp[i])):
temp[i][ind] -= i
if set(temp[0]).intersection(*temp):
result.append(filename)
return self.rankResults(result, string)

And so, once again, we first processed the text in the input query. Then we run one word from the query for each word, add each of these lists results in our General list. We then create a set with the name of 'setted' that accepts the intersection of the first list with all other lists (in essence, taking the intersection of all lists), and leaves us with an intermediate result: the set of all documents containing all the query words.

Now we need to check the intermediate result. So, for each list in the intermediate result, we first make a list of lists of positions of each word in the input query. Then (attention!) we use two nested for loop to iterate through a list of lists. For each item in each list we will take a room i, which increases by 1 when we go through a list of lists. Now, remember that in Python lists keep order, so this list of lists contains the lists of positions of each word in the source query in the order of words in the initial query. Then, if these words are in the correct order, and we subtract the integer i from each position in each list items, and i increases by 1 every time we go through the following list of positions, if the phrase is in the intermediate result, the intersection of all of these modified lists of lists must have length at least one.
Let me give an example:

For example, a phrase in the query is “the cake is a lie”. Now suppose that for a specific file positions of each word are:
the

cake: [19, 35, 12]
this: [179, 36, 197]
false: [221, 37, 912]

Now, our list of lists is:
the

[[19, 35, 12], [179, 36, 197], [221, 37, 912]]

Now, we subtract the i from each element in each list, where i is zero for the first list, 1 for the second list, 2 for the third list, etc.
the

[[19, 35, 12], [178, 35, 196], [219, 35, 910]]

Now we take the intersection of all lists in which the left value of the element number 35. Then it would be possible to determine that the phrase “the cake is a lie” is in the file. And this is true: if we look at the initial list, we see that the sequence "35, 36, 37" gives us our phrase.

There are many more parameters that you could implement yourself (see the advanced search Google for inspiration). You can try to implement some of them on your search engine.

Our last step is to implement a query parser that will allow us to combine different types of queries to get one result set. For example, you can enter something like 'cake “is a lie”' in Google, which combines standard queries (the query) and phrase queries (“is a lie”). It is also very easy to do: just use delimiters (e.g., quotation marks) to indicate a certain type of queries, get rid of each of the smaller queries separately, and then intersect all of these result sets to obtain the final list of documents.

next, the final part, I will tell you about the ranking of results and conduct closing.

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express

Implementation of search engine ranking in Python (Part 2)

querying the index

Комментарии

Отправить комментарий

Популярные сообщения из этого блога

Import iblock from 1C-Bitrix to MODx Revolution

Briefly on how to make your Qt geoservice plugin

Database replication PostgreSQL-based SymmetricDS