Still more on ranking documentation, a=chris

Chris Pollett [2013-04-26 00:Apr:th]
Still more on ranking documentation, a=chris
Filename
en-US/pages/ranking.thtml
diff --git a/en-US/pages/ranking.thtml b/en-US/pages/ranking.thtml
index 771ac99..94a8e0b 100644
--- a/en-US/pages/ranking.thtml
+++ b/en-US/pages/ranking.thtml
@@ -50,29 +50,30 @@
     `n` documents using these three rankings and the so-called
     <b>reciprocal rank fusion  (RRF)</b>:</p>
 <p class="center">
-`\R\R\F(d) := 200(frac{1}{59 + mbox(Rank)_(DR)(d)} +
-frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})`
+`mbox(RRF)(d) := 200(frac{1}{59 + mbox(Rank)_(mbox(DR))(d)} +
+frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 +
+    mbox(Rank)_(mbox(Prox))(d)})`
 </p><p>
     This formula essentially comes from Cormack et al.
     [<a href="#CCB2009">CCB2009</a>]. They do not
-    use the factor `200` and use `60` rather than `59`. `\R\R\F(d)` is known
+    use the factor `200` and use `60` rather than `59`. `mbox(RRF)(d)` is known
     to do a decent job of combining scores, although there are some
     recent techniques such as LambdaRank [<a href="#VLZ2012">VLZ2012</a>],
     which do significantly better at the
     expense of being harder to compute. To return results,
     Yioop computes the top ten of
-    these `n` documents with respect to `\R\R\F(d)` and returns these
+    these `n` documents with respect to `mbox(RRF)(d)` and returns these
     documents.</p>
-    <p> To get a feeling for how the `\R\R\F(d)` formula works, consider some
+    <p> To get a feeling for how the `mbox(RRF)(d)` formula works, consider some
     particular example situations:
     If a document ranked 1 with respect to each score, then
-    `\R\R\F(d) = 200(3/(59+1)) = 10`.  If a document
-    ranked n for each score, then `\R\R\F(d) = 200(3/(59+n)) = 600/(59 + n)`.
+    `mbox(RRF)(d) = 200(3/(59+1)) = 10`.  If a document
+    ranked n for each score, then `mbox(RRF)(d) = 200(3/(59+n)) = 600/(59 + n)`.
     As `n -> infty` this goes to 0. A value `n = 200` is often used with
     Yioop. For this `n`, `600/(59 + n) approx 2.32`.
     If a document
     ranked 1 on one of the three scores, but ranked `n` on the other two,
-    `\R\R\F(d) = 200/60 + 400/(59 +n) approx 3.33 + 400/(59 + n)`. The last
+    `mbox(RRF)(d) = 200/60 + 400/(59 +n) approx 3.33 + 400/(59 + n)`. The last
     term again goes to 0 as `n` gets larger, giving a maximum
     score of `3.33`. For the `n=200` case, one gets a score of `4.88`.
     So because the three component scores are converted to ranks,
@@ -429,19 +430,24 @@ frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})`
     <a href="http://www.sitemaps.org/">sitemaps</a>. Emperically,
     it has been observed that a typical index shard has offsets for around
     24 times as many links summary maps as document summary maps. So
-    roughly, if a newly added summary or link has index <i>DOC_INDEX</i>
-    in the active shard, and the active shard is the GENERATIONth shard,
+    roughly, if a newly added summary or link, d, has index <i>DOC_INDEX(d)</i>
+    in the active shard, and the active shard is the GENERATION(d) shard,
     the newly add object will have
     </p>
+<blockquote>
     <p>
-    `mbox(RANK) = (mbox(DOC_INDEX) + 1) + (mbox(AVG_LINKS_PER_PAGE) + 1) times
-            mbox(NUM_DOCS_PER_GENERATION) times mbox(GENERATION))`
-    `\qquad  = (mbox(DOC_INDEX) + 1) + 25 times
-            mbox(NUM_DOCS_PER_GENERATION) times mbox(GENERATION))`
+    \begin{eqnarray}
+    \mbox{RANK}(d) &=& (\mbox{DOC_INDEX}(d) + 1) +
+            (\mbox{AVG_LINKS_PER_PAGE} + 1) \times\\
+            &&\mbox{NUM_DOCS_PER_GENERATION} \times \mbox{GENERATION}(d))\\
+    &=& (\mbox{DOC_INDEX}(d) + 1) + 25 \times
+            \mbox{NUM_DOCS_PER_GENERATION} \times \mbox{GENERATION}(d))
+    \end{eqnarray}
     </p>
+</blockquote>
     <p>To make this a score out of 10, we can use logarithms:</p>
-    <p>`mbox(DOC_RANK) = 10 - log_(10)(mbox(RANK)).`</p>
-    <p>This gives us a Doc Rank for one link or summary item stored
+    <p class='center'>`mbox(DR)(d) = 10 - log_(10)(mbox(RANK)(d)).`</p>
+    <p>Here `mbox(DR)(d)` is the Doc Rank for one link or summary item stored
     in a Yioop index. However, as we will see, this does not give us the
     complete value of Doc Rank for an item when computed at query time.
     There also some things to note about this formula:</p>
@@ -457,6 +463,12 @@ frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})`
     to index 10 billion items using Yioop you would probably want
     multiple queue servers, Doc Rank's likely remain positive for larger
     indexes.</li>
+    <li>If we imagined the Yioop indexed the web as a balanced tree starting
+    from some seed node where Rank labels the nodes of the tree level-wise,
+    then `log_(25)(mbox(RANK)(d)) = (log_(10)(mbox(RANK)(d)))/(log_(10)(25))`
+    would be an estimate of the depth of a node in this tree. So DOC RANK
+    can be viewed as an estimate of how far we are away from the root,
+    with 10 being at the root.</li>
     <li>Doc Rank is computed by different queue servers independent of each
     other for the same index. So it is possible for two summaries to
     have the same Doc Rank in the same index if they are stored on different
@@ -699,14 +711,20 @@ group one gets performing this process after all groups with the same hash
 as `G'` have been merged. We now describe how the individual items in `G`
 have their score computed, and finally, how these scores are combined.
 </p>
-<p>The Doc Rank of an item is calculated according to the formula mentioned
-in the <a href="#queue-servers">queue servers subsection</a>:</p>
+<p>The Doc Rank of an item d, `DR(d)`, is calculated according to the formula
+mentioned in the <a href="#queue-servers">queue servers subsection</a>:</p>
+<blockquote>
 <p>
-`mbox(RANK) = (mbox(DOC_INDEX) + 1) + (mbox(AVG_LINKS_PER_PAGE) + 1) times
-        mbox(NUM_DOCS_PER_GENERATION) times mbox(GENERATION))`
-`\qquad  = (mbox(DOC_INDEX) + 1) + 25 times
-        mbox(NUM_DOCS_PER_GENERATION) times mbox(GENERATION))`
+    \begin{eqnarray}
+    \mbox{RANK}(d) &=& (\mbox{DOC_INDEX}(d) + 1) +
+            (\mbox{AVG_LINKS_PER_PAGE} + 1) \times\\
+            &&\mbox{NUM_DOCS_PER_GENERATION} \times \mbox{GENERATION}(d))\\
+    &=& (\mbox{DOC_INDEX}(d) + 1) + 25 \times
+            \mbox{NUM_DOCS_PER_GENERATION} \times \mbox{GENERATION}(d))\\
+    \mbox{DR}(d) &=& 10 - \log_{10}(\mbox{RANK}(d))
+    \end{eqnarray}
 </p>
+</blockquote>
 <p>To compute the relevance of an item, we use a variant of
 BM25F [<a href="#ZCTSR2004">ZCTSR2004</a>]. Suppose a query `q` is a set
 of terms `t`. View a item `d`  as a bag of terms, let `f_(t,d)` denote
@@ -716,22 +734,26 @@ the length of `d`, where length is
 the number of terms including repeats it contains, and
 let `l_{avg}` denote the average length of an item in the index. The basic
 BM25 formula is:</p>
+<blockquote>
 <p>
-`S\c\o\r\e_(BM25)(q, d) = sum_(t in q) IDF(t) cdot TF_(BM25)(t,d)`, where<br />
-`IDF(t) = log(frac(N)(N_t))`, and<br />
-`TF_(BM25) =
+`mbox(Score)_(mbox(BM25))(q, d) = sum_(t in q) mbox(IDF)(t)
+cdot mbox(TF)_(mbox(BM25))(t,d)`, where<br />
+`mbox(IDF)(t) = log(frac(N)(N_t))`, and<br />
+`mbox(TF)_(mbox(BM25))(t,d) =
 frac(f_(t,d)\cdot(k_1 +1))(f_(t,d) + k_1 cdot ((1-b) + b cdot(l_d / l_(avg)) ))`
 </p>
-<p>`IDF(t)`, the inverse document frequency of `t`, in the above can be
+</blockquote>
+<p>`mbox(IDF)(t)`, the inverse document frequency of `t`, in the above can be
 thought as measure of how much signal is provided by knowing that the term `t`
 appears in the document. For example, its value is zero if `t` is in every
-document; whereas the more rare the term is the larger than value of `IDF(t)`.
-`TF_(BM25)` represents a normalized term frequency for `t`. Here `k_1 = 1.2`
-and `b=0.75` are tuned parameters which are set to values commonly used
-in the literature. It is normalized to prevent bias toward longer
+document; whereas the more rare the term is the larger than value of
+`mbox(IDF)(t)`.
+`mbox(TF)_(mbox(BM25))` represents a normalized term frequency for `t`.
+Here `k_1 = 1.2` and `b=0.75` are tuned parameters which are set to values
+commonly used in the literature. It is normalized to prevent bias toward longer
 documents. Also, if one spams a document filling it with one the term `t`,
-we have `lim_(f_(t,d) -> infty) TF_(BM25)(t,d) = k_1 +1`, which limits
-the ability to push the document score larger.
+we have `lim_(f_(t,d) -> infty) mbox(TF)_(mbox(BM25))(t,d) = k_1 +1`, which
+limits the ability to push the document score larger.
 </p>
 <p>Yioop computes a variant of BM25F not BM25. This formula also
 needs to have values for things like `L_(avg)`, `N`, `N_t`. To keep the
@@ -741,39 +763,99 @@ a stand-in. BM25F is essentially the same as BM25 except that it separates
 a document into components, computes the BM25 score of the document with
 respect to each component and then takes a weighted sum of these scores.
 In the case of Yioop, if the item is a page the two components
-are an ad hoc title and a description. Recall when making our positions
+are an ad hoc title and a description. Recall when making our position
 lists for a term in a documents that we concatenated url keywords,
 followed by title, followed by summary. So the first terms in the result
 will tend to be from title. We take the first AD_HOC_TITLE_LEN many terms
 from a document to be in the ad hoc title. We calculate an ad hoc title
 BM25 score for a term from a query being in the ad hoc title of an item.
 We multiply this by 2 and then compute a BM25 score of the term being in
-the rest of the summary. We add the two results. For link items we don't
-separate them into two component but can weight the BM25 score different
+the rest of the summary. We add the two results. I.e.,</p>
+<p class='center'>
+`mbox(Rel)(q, d) = 2 times mbox(Score)_(mbox(BM25-Title))(q, d) +
+    mbox(Score)_(mbox(BM25-Description))(q, d)`</p>
+<p>
+This score would be the relevance  for a single summary item `d` with respect
+to `q`. For link items we don't
+separate into title and description but can weight the BM25 score different
 than for a page (currently though link weight is set to 1 by default).
 These three weights: title weight, description weight, and link weight can
 be set in Page Options &gt; Search Time &gt; Search Rank Factors .
 </p>
 <p>To compute the proximity score of an item `d` with respect to
-a query `q` with more than one term. we use the notion of a <b>cover</b>.
-A cover is an interval `[u_i, v_i]` of positions within `d` which contain
-all the terms in `q` such that no smaller interval contains all the
-terms. Given `d` we can calculate a proximity score as a sum of
-the inverse of the sizes of the covers:</p>
+a query `q` with more than one term. we use the notion of a <b>span</b>.
+A span is an interval `[u_i, v_i]` of positions within `d` which contain
+all the terms (including repeats) in `q` such that no smaller interval contains
+all the terms. Given `d` we can calculate a proximity score as a sum of
+the inverse of the sizes of the spans:</p>
 <p class='center'>
-`mbox(score)(d) = sum(frac(1)(v_i - u_i + 1))`.
+`mbox(pscore)(d) = sum(frac(1)(v_i - u_i + 1))`.
 </p>
-<p>For a page item, Yioop calculates separate proximity scores with
+<p>This formula comes from Clark et al. [<a href="#CCT2000">CCT2000</a>]
+except they use covers which ignore repeats. It is the starting point of our
+proximity calculation. For a page item, Yioop calculates separate pscores with
 respect to its ad hoc title and the rest of a summary. It then adds
 them with the same weight as was done for the BM25F relevance score.
-Similarly, link item proximities also have a weight factor multiplied against
-them.
+Similarly, link item pscores also have a weight factor multiplied against
+them. Finally, Yioop normalizes the pscore calculated
+with these weights by item length to get:</p>
+<p class='center'>
+`mbox(Prox)(d) = (100 times mbox(weighted-pscore)(d))/l_d`.
 </p>
 <p>Now that we have described how to compute Doc Rank, Relevance, and Proximity
 for each item in a group, we now describe how to get these three values
-for the whole group.
+for the whole group. Since both Relevance and Proximity as we have defined
+have a normalization for document length, it is reasonable to take
+a statistic such as median or average value to compute the Proximity
+or Relevance for the group. An average has the drawback that a given
+site might be able to skew the statistic, and spam the value for a group.
+Since neither Relevance nor Proximity make
+use of a notion of page importance, a straight median can also be spammed --
+a single domain with lots of pages could skew the median. To solve these
+issue Yioop treats each domain within the group as its
+own subgroup and computes an average proximity value and relevance value for
+that subgroup, then it takes the median value of all the subgroup
+values to get the group proximity value and relevance value.
+One thing to notice about groups is that they
+are query dependent: Which links to a page have all the query terms depends
+on the query terms. So in coming up with a document rank for a group of
+items we will have introduced a query dependence to our notion of
+document rank. The scheduling algorithm, using company level domains,
+of Yioop already makes an attempt at preventing Doc Rank from being easily
+manipulated. So taking a weighted sum of the Doc Ranks of a group seems
+reasonable. Yioop uses three different weights: We use a weight of 2
+if an item is the summary of domain name page, we use a weight of 1 for
+any other summary page item, and a weight of 1/2 for a link item. The
+justification for the slightly lower weights for links is that some
+links have already contributed to the given url being crawled; whereas,
+some have not, so the score of 1/2 was arbitrarily chosen to adjust for
+this.
+</p>
+<p>This completes our description of the Yioop scoring mechanism
+in the conjunctive query case. Given a url u let `[u]` denote
+the set of all items in an index that might be grouped with `u`.
+For a query `q` many items in `[u]` might not contain all the terms
+in `q` and so by Yioop's score mechanism not contribute to the score of
+this result. Let `mbox(Dom)([u])` denote the distinct domain names of
+urls in `[u]`. For a url `u'` and domain name `d`, write `u in d` if
+the domain name of `u` is `d`.  Let `mbox(type)(i)` be one of <i>domain</i>,
+<i>page</i>, <i>link</i> and denote the type of an item.
+Given this let `mbox(wt)(mbox(type)(i))` denote its weight.
+Using these notations, we can summarize
+how scores of group `[u]` from the score of its items are calculated
+with the following equations:
+</p>
+<p>
+\begin{eqnarray}
+\mbox{Rel}(q, [u]) &=& \mbox{Median}_{d \in\mbox{Dom}([u]) }(
+    \mbox{Avg}_{i \in d}(\mbox{Rel}(q,i))).\\
+\mbox{Prox}(q, [u]) &=& \mbox{Median}_{d \in\mbox{Dom}([u]) }(
+    \mbox{Avg}_{i \in d}(\mbox{Prox}(q,i))).\\
+\mbox{DR}(q, [u]) &=& \sum_{i\in[u]}\mbox{DR}(i)\cdot \mbox{wt}(\mbox{type}(i)).
+\end{eqnarray}
 </p>

+
     <p><a href="#toc">Return to table of contents</a>.</p>
     <h2 id="references">References</h2>
     <dl>
@@ -790,6 +872,15 @@ pp. 280-290. 2003.
     >The Anatomy of a Large-Scale Hypertextual Web Search Engine</a>.
 In: Seventh International World-Wide Web Conference
 (WWW 1998). April 14-18, 1998. Brisbane, Australia. 1998.</dd>
+
+<dt id="CCT2000">[CCT2000]</dt>
+<dd>Charles L. A. Clarke and Gordon V. Cormack and  Elizabeth A. Tudhope.
+<a href="http://citeseerx.ist.psu.edu/viewdoc/download?
+doi=10.1.1.12.1615&amp;rep=rep1&amp;type=pdf"
+>Relevance Ranking for One to Three Term Queries</a>. In:
+Information Processing Management. Vol. 36. Iss. 2. pp.291--311. 2000.
+</dd>
+
 <dt id="CCB2009">[CCB2009]</dt>
 <dd>Gordon V. Cormack and Charles L. A. Clarke and Stefan Büttcher.
 <a href="http://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf"
@@ -843,5 +934,6 @@ In Proceedings of 3th Annual Text Retrieval Conference. 2004.</dd>
     <p><a href="#toc">Return to table of contents</a>.</p>
 </div>
 <script type="text/javascript"
-   src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_HTMLorMML"></script>
+   src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?
+config=TeX-MML-AM_HTMLorMML"></script>
ViewGit