信息检索导论第七章笔记(英文)

   声明:本站部分内容来自互联网,如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。

8

文章目录

    • Abstract
    • Efficient scoring and ranking
      • Index Elimination
      • Champion lists
      • Static quality scores and ordering
      • Cluster pruning
    • Components of an information retrieval system
      • Tiered indexes
      • Query term proximity
    • A Search engine
    • Designing parsing
    • Scoring function

Abstract

  • Speed up scoring algorithm by using heuristics methods

Compromise in lower time complexity and better query matching scores.

  • Start consider more general problem of computing scores in a search engine

  • Outline a complete search engine, including indexes and structures to support not only cosine scoring, but also more general ranking factors. We will fit them together.

  • How the vector space model for free text queries interacts with common query operators.

Efficient scoring and ranking

In our previous scoring and sorting methods, we need to calculate the cosine scores of the query and each documents, and then we need to take out the top k documents with the highest score, which is very complicated.

query: q = jealous gossip

  • The unit vector v(q) has only two nonzero components.
  • In the absence of any weighting for query terms, these nonzero components are equal.

In fact, if there is an algorithm that can approximate the first k documents, but the complexity is very small (it does not need to calculate the scores of all documents), we usually use the latter algorithm:
General method: find document subset a (far smaller than the initial document set) in advance, including most candidate documents, and calculate the top k documents with the highest score in that.

The Formula:

So the fast cosine score algorithm:

In the Manning’s book, it is called the Inexact top K document retrieval, having the folling two-step scheme.

  1. Find a setAof documents that are contenders, where K < |A| << N. A does not necessarily contain the K top-scoring documents for the query,but is likely to have many documents with scores near those of the top K.
  2. Return the K top-scoring documents in A.

Index Elimination

  1. Only consider the posting when the IDF of the term exceeds the threshold.

Because the term with low IDF is usually stop words and the posting is very long.
It will greatly reduce the complexity if these are not calculated, so it is unnecessary to consider them.

  • In this case, if the number of documents exceeding the threshold does not exceed K, it would be a good idea to solve the problem with hierarchical index.
  • Hierarchical index:
  • the inverted record table is layered.
    For example, if the TF is more than 20, it is in the first level, and if the TF is more than 10 but lower than 20, it is in the second level. When you need to find the first k documents, you only need to search at the first level. If it doesn’t get enough K articles, we can go to the second level to add.

Therefore, hierarchical index is a way to solve the problem that fewer than k documents may be returned;

  1. Only documents with multiple query terms are considered.

Champion lists

  • Champion list

for the word term t, the R documents with the highest TF value of posting are extracted in advance, which is called the champion list.

Given a query Q
only need to find the union of the champion lists of each term in Q. this union is the document subset a in the above general method, and calculates the cosine similarity in a for each documents.

Static quality scores and ordering

Each document has a static score g(d) unrelated to the query, and the posting in the inverted index is arranged in descending order according to g(d).

Obviously, g(3) > g(2) > g(1)

The final score is score(Q, d) = g(d) + V (q) V (d).

Note:
PageRank is a static quality score, which is based on web link analysis;

  • One further idea

For term T, two tables are maintained:

  1. the high table (M documents with the highest TF value)
  2. the low table (the rest), which are sorted by g(d).

The method to get the highest score of K documents:
first calculate the score of the high table, and if it has been able to take out the document with the highest score in the high table, the process finishes.
otherwise, the rest will be taken from the low table.

Cluster pruning

basic concept:

  • leader:

Pick √N documents at random from the collection. Call theseleaders

  • follower:

For each document that is not a leader, we compute its nearest leader.

Query Method:

  1. Given the query Q, first calculate the cosine similarity score with each leader to find the nearest leader.
  2. Document subset a is the follower corresponding to leader + leader

Components of an information retrieval system

Tiered indexes

I mentioned in “Static quality scores and ordering” section.

If there is a two-level tiered indexes, first calculate the score of the high table, and if it has been able to take out the document with the highest score in the high table, the process finishes.
otherwise, the rest will be taken from the low table.

Query term proximity

We hope the query words to be close to each other in the document, so that the document and the query can be more relevant, which is called a good search engine performance.

  • basic concepts

Smallest window:
in a document d that contains all the query terms, measured in the number of words in the window.

Soft conjunctive:
the document does not need to contain all the query terms, only need to include most of the query terms.

  • Consider a query

Consider a query with two or more query terms,t1,t2,…,tk. Let ω be the width of the smallest window in a documentdthat contains all the query terms, measured in the number of words in the window.

Intuitively,the smaller that ω is, the better that d matches the query. In cases where the document does not contain all of the query terms, we can set ω to be some enormous number.

Note:
Probably, it needs to add term similarity to the weights.

A Search engine

Indexer is used to generate various indexes, such as parametric index, domain index, k-gram index and hierarchical index.

The vector space model is different from the Boolean retrieval model.

The Boolean model only considers whether the term exists in the document, but does not consider how many times it appears without weights.

Designing parsing

Sometimes, this executioncan entail multiple queries against the underlying indexes; for example, the query parser may issue a stream of queries:

Each of these steps (if invoked) may yield a list of scored documents, this demands an aggregate scoring function.

Scoring function

The answer depends on the setting.

Such as

  1. introduction of new document formats or document management systems, or a merger with another company.
  2. incresingly commonly to use machine-learned scoring to address this, which will be discussed further in Section 15.
事件概率(%)时间边界
Speech0.25657
Gunfire0.11153
Car2.69401
Crying0.00156
Fire75.878980.0-10.0
Fireworks0.77396
Glass0.10306
Screaming0.04901
Explosion36.809330.0-2.72, 8.34-10.0

这里作为自己的笔记和总结,借鉴了manning原书还有csdn博主:

  • iteye_17686:(传送门)

英语难免有些错误大家见谅。
大家共勉~~