More on ranking, a=chris

Chris Pollett [2013-04-24 18:Apr:th]

More on ranking, a=chris

Filename
en-US/pages/ranking.thtml

diff --git a/en-US/pages/ranking.thtml b/en-US/pages/ranking.thtml
index 046a713..771ac99 100644
--- a/en-US/pages/ranking.thtml
+++ b/en-US/pages/ranking.thtml
@@ -629,20 +629,151 @@ a query comes into Yioop it goes through the following stages before an actual
 look up is performed against an index.
 </p>
 <ol>
-<li>Control words are calculated.</li>
-<li>An attempt is made to guess the semantics of the query.</li>
+<li>Control words are calculated. Control words are terms like
+m: or i: terms which can be used to select a mix or index to use.
+They are also commands like raw: which says what level of grouping to
+use, or no: commands which say not to use a standard processing technique.
+For example, no:guess (affects
+whether the next processing step is done),
+no:network, etc. For the remainder, we will
+assume the query does not contain control words.</li>
+<li>An attempt is made to guess the semantics of the query. This
+matches keywords in the query and rewrites them to other query terms.
+For example, a query term which is in the form of a domain name, will
+be rewritten to the form of a meta word, site:domain. So the query will
+return only pages from the domain. Currently, this processing is
+in a nascent stage. As another example, if you do a search only
+on "D", it will rewrite the search to be "letter D".</li>
 <li>Stemming or character n-gramming is done on the query and acronyms
-and abbreviations are rewritten.</li>
+and abbreviations are rewritten. This is the same kind of operation
+that we did after generating summaries to extract terms.</li>
 </ol>
+<p>After going through the above steps, Yioop builds an iterator
+object from the resulting terms to iterate over summaries and link
+entries that contain all of the terms. In the single queue server setting
+one iterator would be built for each term and these iterators
+would be added to an intersect iterator that would return documents
+on which all the terms appear. These iterators are then fed into
+a grouping iterator, which groups links and summaries that refer
+to the same document. Recall after downloading pages on the fetcher
+we calculated a hash from the downloaded page minus tags. Documents
+with the same hash are also grouped together by the group iterator.
+The value `n=200` posting list entries that Yioop scans out on a query
+referred to in the introduction is actually the number of results
+the group iterator requests before grouping. This number can be
+controlled from the Yioop admin pages under Page Options &gt;
+Search Time &gt; Minimum Results to Group. The number 200 was chosen
+because on a single machine it was found to give decent results without
+the queries taking too long.
+</p>
+<p>In the multiple queue server setting, the query comes in to
+ the name server a network iterator is built which poses the
+query to each queue server. If `n=200`, the name server
+multiplies this value by the value
+Page Options &gt;
+Search Time &gt; Server Alpha, which we'll denote `alpha`. This defaults
+to 1.6, so the total is 320. It then divides this by the number
+of queue servers. So if there were 4 queue servers, one would have
+80. It then requests the first 80 results for the query from each
+queue server. The queue servers don't do grouping, but just
+ send the results of their intersect iterators to the name server, which
+does the grouping.</p>
+<p>In both the networked and non-networked case, after the grouping
+phase Doc Rank, Relevance, and Proximity scores for each of the grouped results
+have been determined. We then combine these three scores into a single
+score using the reciprocal rank fusion technique described in the introduction.
+Results are then sorted in descending order of score and output.
+What we have left to describe is how the scores are calculated in the
+various iterators mentioned above.</p>
+<p>To fix an example to describe this process, suppose we have a group
+`G'` of items `i_j'`, either pages or links that all refer to the same url.
+A page in this group means at some point we downloaded the url and extracted
+a summary. It is possible for there to be multiple pages in a group because
+we might re-crawl a page. If we have another group `G''` of items `i_k''` of
+this kind that such that the hash of the most recent page matches that
+of `G'`, then the two groups are merged. While we are grouping we are
+computing a temporary overall score for a group. The temporary score is used to
+determine which page's (or link's if no pages are present) summaries in a group
+should  be used as the source of url, title, and snippets. Let `G` be the
+group one gets performing this process after all groups with the same hash
+as `G'` have been merged. We now describe how the individual items in `G`
+have their score computed, and finally, how these scores are combined.
+</p>
+<p>The Doc Rank of an item is calculated according to the formula mentioned
+in the <a href="#queue-servers">queue servers subsection</a>:</p>
+<p>
+`mbox(RANK) = (mbox(DOC_INDEX) + 1) + (mbox(AVG_LINKS_PER_PAGE) + 1) times
+        mbox(NUM_DOCS_PER_GENERATION) times mbox(GENERATION))`
+`\qquad  = (mbox(DOC_INDEX) + 1) + 25 times
+        mbox(NUM_DOCS_PER_GENERATION) times mbox(GENERATION))`
+</p>
+<p>To compute the relevance of an item, we use a variant of
+BM25F [<a href="#ZCTSR2004">ZCTSR2004</a>]. Suppose a query `q` is a set
+of terms `t`. View a item `d`  as a bag of terms, let `f_(t,d)` denote
+the frequency of the term `t` in `d`, let `N_t` denote the number of items
+containing `t` in the whole index (not just the group), let `l_d` denote
+the length of `d`, where length is
+the number of terms including repeats it contains, and
+let `l_{avg}` denote the average length of an item in the index. The basic
+BM25 formula is:</p>
+<p>
+`S\c\o\r\e_(BM25)(q, d) = sum_(t in q) IDF(t) cdot TF_(BM25)(t,d)`, where<br />
+`IDF(t) = log(frac(N)(N_t))`, and<br />
+`TF_(BM25) =
+frac(f_(t,d)\cdot(k_1 +1))(f_(t,d) + k_1 cdot ((1-b) + b cdot(l_d / l_(avg)) ))`
+</p>
+<p>`IDF(t)`, the inverse document frequency of `t`, in the above can be
+thought as measure of how much signal is provided by knowing that the term `t`
+appears in the document. For example, its value is zero if `t` is in every
+document; whereas the more rare the term is the larger than value of `IDF(t)`.
+`TF_(BM25)` represents a normalized term frequency for `t`. Here `k_1 = 1.2`
+and `b=0.75` are tuned parameters which are set to values commonly used
+in the literature. It is normalized to prevent bias toward longer
+documents. Also, if one spams a document filling it with one the term `t`,
+we have `lim_(f_(t,d) -> infty) TF_(BM25)(t,d) = k_1 +1`, which limits
+the ability to push the document score larger.
+</p>
+<p>Yioop computes a variant of BM25F not BM25. This formula also
+needs to have values for things like `L_(avg)`, `N`, `N_t`. To keep the
+computation simple at the loss of some accuracy when Yioop needs these values
+it uses information from the statistics in the particular index shard of `d` as
+a stand-in. BM25F is essentially the same as BM25 except that it separates
+a document into components, computes the BM25 score of the document with
+respect to each component and then takes a weighted sum of these scores.
+In the case of Yioop, if the item is a page the two components
+are an ad hoc title and a description. Recall when making our positions
+lists for a term in a documents that we concatenated url keywords,
+followed by title, followed by summary. So the first terms in the result
+will tend to be from title. We take the first AD_HOC_TITLE_LEN many terms
+from a document to be in the ad hoc title. We calculate an ad hoc title
+BM25 score for a term from a query being in the ad hoc title of an item.
+We multiply this by 2 and then compute a BM25 score of the term being in
+the rest of the summary. We add the two results. For link items we don't
+separate them into two component but can weight the BM25 score different
+than for a page (currently though link weight is set to 1 by default).
+These three weights: title weight, description weight, and link weight can
+be set in Page Options &gt; Search Time &gt; Search Rank Factors .
+</p>
+<p>To compute the proximity score of an item `d` with respect to
+a query `q` with more than one term. we use the notion of a <b>cover</b>.
+A cover is an interval `[u_i, v_i]` of positions within `d` which contain
+all the terms in `q` such that no smaller interval contains all the
+terms. Given `d` we can calculate a proximity score as a sum of
+the inverse of the sizes of the covers:</p>
+<p class='center'>
+`mbox(score)(d) = sum(frac(1)(v_i - u_i + 1))`.
+</p>
+<p>For a page item, Yioop calculates separate proximity scores with
+respect to its ad hoc title and the rest of a summary. It then adds
+them with the same weight as was done for the BM25F relevance score.
+Similarly, link item proximities also have a weight factor multiplied against
+them.
+</p>
+<p>Now that we have described how to compute Doc Rank, Relevance, and Proximity
+for each item in a group, we now describe how to get these three values
+for the whole group.
+</p>

-Network Versus non network queries
-Grouping (links and documents) deduplication
-Conjunctive queries
-Scores BM25F, proximity, document rank
-
-News articles, Images, Videos
-
-How related queries work
     <p><a href="#toc">Return to table of contents</a>.</p>
     <h2 id="references">References</h2>
     <dl>
@@ -691,6 +822,7 @@ March, 2008.
 Proceedings of the 10th international conference on World Wide Web.
 pp 114--118. 2001.
 </dd>
+
 <dt id="VLZ2012">[VLZ2012]</dt>
 <dd>Maksims Volkovs, Hugo Larochelle, and Richard S. Zemel.
 <a href="http://www.cs.toronto.edu/~zemel/documents/cikm2012_paper.pdf"
@@ -698,6 +830,14 @@ pp 114--118. 2001.
 21st ACM International Conference on Information and Knowledge Management.
 pp. 843-851. 2012.
 </dd>
+
+<dt id="ZCTSR2004">[ZCTSR2004]</dt>
+<dd>Hugo Zaragoza, Nick Craswell, Michael Taylor,
+Suchi Saria, and Stephen Robertson.
+<a
+href="http://trec.nist.gov/pubs/trec13/papers/microsoft-cambridge.web.hard.pdf"
+>Microsoft Cambridge at TREC-13: Web and HARD tracks</a>.
+In Proceedings of 3th Annual Text Retrieval Conference. 2004.</dd>
 </dl>

     <p><a href="#toc">Return to table of contents</a>.</p>

ViewGit