文章目录
- Abstract
- Efficient scoring and ranking
- Index Elimination
- Champion lists
- Static quality scores and ordering
- Cluster pruning
- Components of an information retrieval system
- Tiered indexes
- Query term proximity
- A Search engine
- Designing parsing
- Scoring function
Abstract
- Speed up scoring algorithm by using heuristics methods
Compromise in lower time complexity and better query matching scores.
-
Start consider more general problem of computing scores in a search engine
-
Outline a complete search engine, including indexes and structures to support not only cosine scoring, but also more general ranking factors. We will fit them together.
-
How the vector space model for free text queries interacts with common query operators.
Efficient scoring and ranking
In our previous scoring and sorting methods, we need to calculate the cosine scores of the query and each documents, and then we need to take out the top k documents with the highest score, which is very complicated.
query: q = jealous gossip
- The unit vector v(q) has only two nonzero components.
- In the absence of any weighting for query terms, these nonzero components are equal.
In fact, if there is an algorithm that can approximate the first k documents, but the complexity is very small (it does not need to calculate the scores of all documents), we usually use the latter algorithm:
General method: find document subset a (far smaller than the initial document set) in advance, including most candidate documents, and calculate the top k documents with the highest score in that.
The Formula:
So the fast cosine score algorithm:
In the Manning’s book, it is called the Inexact top K document retrieval, having the folling two-step scheme.
- Find a setAof documents that are contenders, where K < |A| << N. A does not necessarily contain the K top-scoring documents for the query,but is likely to have many documents with scores near those of the top K.
- Return the K top-scoring documents in A.
Index Elimination
- Only consider the posting when the IDF of the term exceeds the threshold.
Because the term with low IDF is usually stop words and the posting is very long.
It will greatly reduce the complexity if these are not calculated, so it is unnecessary to consider them.
- In this case, if the number of documents exceeding the threshold does not exceed K, it would be a good idea to solve the problem with hierarchical index.
- Hierarchical index:
- the inverted record table is layered.
For example, if the TF is more than 20, it is in the first level, and if the TF is more than 10 but lower than 20, it is in the second level. When you need to find the first k documents, you only need to search at the first level. If it doesn’t get enough K articles, we can go to the second level to add.
Therefore, hierarchical index is a way to solve the problem that fewer than k documents may be returned;
- Only documents with multiple query terms are considered.
Champion lists
- Champion list
for the word term t, the R documents with the highest TF value of posting are extracted in advance, which is called the champion list.
Given a query Q
only need to find the union of the champion lists of each term in Q. this union is the document subset a in the above general method, and calculates the cosine similarity in a for each documents.
Static quality scores and ordering
Each document has a static score g(d) unrelated to the query, and the posting in the inverted index is arranged in descending order according to g(d).
Obviously, g(3) > g(2) > g(1)
The final score is score(Q, d) = g(d) + V (q) V (d).
Note:
PageRank is a static quality score, which is based on web link analysis;
- One further idea
For term T, two tables are maintained:
- the high table (M documents with the highest TF value)
- the low table (the rest), which are sorted by g(d).
The method to get the highest score of K documents:
first calculate the score of the high table, and if it has been able to take out the document with the highest score in the high table, the process finishes.
otherwise, the rest will be taken from the low table.
Cluster pruning
basic concept:
- leader:
Pick √N documents at random from the collection. Call theseleaders
- follower:
For each document that is not a leader, we compute its nearest leader.
Query Method:
- Given the query Q, first calculate the cosine similarity score with each leader to find the nearest leader.
- Document subset a is the follower corresponding to leader + leader
Components of an information retrieval system
Tiered indexes
I mentioned in “Static quality scores and ordering” section.
If there is a two-level tiered indexes, first calculate the score of the high table, and if it has been able to take out the document with the highest score in the high table, the process finishes.
otherwise, the rest will be taken from the low table.
Query term proximity
We hope the query words to be close to each other in the document, so that the document and the query can be more relevant, which is called a good search engine performance.
- basic concepts
Smallest window:
in a document d that contains all the query terms, measured in the number of words in the window.
Soft conjunctive:
the document does not need to contain all the query terms, only need to include most of the query terms.
- Consider a query
Consider a query with two or more query terms,t1,t2,…,tk. Let ω be the width of the smallest window in a documentdthat contains all the query terms, measured in the number of words in the window.
Intuitively,the smaller that ω is, the better that d matches the query. In cases where the document does not contain all of the query terms, we can set ω to be some enormous number.
Note:
Probably, it needs to add term similarity to the weights.
A Search engine
Indexer is used to generate various indexes, such as parametric index, domain index, k-gram index and hierarchical index.
The vector space model is different from the Boolean retrieval model.
The Boolean model only considers whether the term exists in the document, but does not consider how many times it appears without weights.
Designing parsing
Sometimes, this executioncan entail multiple queries against the underlying indexes; for example, the query parser may issue a stream of queries:
Each of these steps (if invoked) may yield a list of scored documents, this demands an aggregate scoring function.
Scoring function
The answer depends on the setting.
Such as
- introduction of new document formats or document management systems, or a merger with another company.
- incresingly commonly to use machine-learned scoring to address this, which will be discussed further in Section 15.
事件 | 概率(%) | 时间边界 |
---|---|---|
Speech | 0.25657 | – |
Gunfire | 0.11153 | – |
Car | 2.69401 | – |
Crying | 0.00156 | – |
Fire | 75.87898 | 0.0-10.0 |
Fireworks | 0.77396 | – |
Glass | 0.10306 | – |
Screaming | 0.04901 | – |
Explosion | 36.80933 | 0.0-2.72, 8.34-10.0 |
这里作为自己的笔记和总结,借鉴了manning原书还有csdn博主:
- iteye_17686:(传送门)
英语难免有些错误大家见谅。
大家共勉~~