Simple solutions¶
The real key technique used by search engines
The tool needed¶
Spider¶
It's a tool to help you fetch web pages from the Internet.
How can we get the page which contain "Computer Science"¶
Scan each page for the string "Computer Science"¶
Term-Document Incidence Matrix¶
How can we get the passage which contains both "silver" and "truck"?
- 但是,这样的方式会造成表格中有很多个0,进一步导致空间、时间的浪费
Compact Version - Inverted Index¶
What is index?¶
Index is a mechanism for locating given term in a text.
简单来说,index就是指针,用来指示一个值的位置
What is Inverted file¶
Inverted file contains a list of pointers to all occurrences of that term in the text.
Term Dictionary¶
第二列是Term Dictionary,第二列是Posting list
x意味着总频率,就是这个数在文本中出现了几次
an1;an2意味着在第an1个文档中出现,出现的次数是an2次
- 当查找的时候,我们可以从频率小的关键词开始查询,减少总共的查询次数
Index generator¶
Inverted file index也被称为index generator
伪代码¶
Problem¶
1¶
D
最后不需要计算精度